git-annex/doc/todo/import_tree.mdwn
2019-02-13 16:28:02 -04:00

251 lines
11 KiB
Markdown

When `git annex export treeish --to remote` is used to export to a remote,
and the remote allows files to somehow be edited on it, then there ought
to be a way to import the changes back from the remote into the git repository.
The command could be `git annex import --from remote`
It would find changed/new/deleted files on the remote.
Download the changed/new files and inject into the annex.
Generate a new treeish, with parent the treeish that was exported,
that has the modifications in it.
Updating the working copy is then done by merging the import treeish.
This way, conflicts will be detected and handled as normal by git.
## content identifiers
The remote is responsible for collecting a list of
files currently in it, along with some content identifier. That data is
sent to git-annex. git-annex keeps track of which content identifier(s) map
to which keys, and uses the information to determine when a file on the
remote has changed or is new.
git-annex can simply build git tree objects as the file list
comes in, looking up the key corresponding to each content identifier
(or downloading the content from the remote and adding it to the annex
when there's no corresponding key yet). It might be possible to avoid
git-annex buffering much tree data in memory.
----
A good content identifier needs to:
* Be stable, so when a file has not changed, the content identifier
remains the same.
* Change when a file is modified.
* Be as unique as possible, but not necessarily fully unique.
A hash of the content would be ideal.
A (size, mtime, inode) tuple is as good a content identifier as git uses in
its index.
git-annex will need a way to get the content identifiers of files
that it stores on the remote when exporting a tree to it, so it can later
know if those files have changed.
----
The content identifier needs to be stored somehow for later use.
It would be good to store the content identifiers only locally, if
possible.
Would local storage pose a problem when multiple repositories import from
the same remote? In that case, perhaps different trees would be imported,
and merged into master. So the two repositories then have differing
masters, which can be reconciled in merge as usual.
Since exporttree remotes don't have content identifier information yet, it
needs to be collected the first time import tree is used. (Or import
everything, but that is probably too expensive). Any modifications made to
exported files before the first import tree would not be noticed. Seems
acceptible as long as this only affects exporttree remotes created before
this feature was added.
What if repo A is being used to import tree from R for a while, and the
user gets used to editing files on R and importing them. Then they stop
using A and switch to clone B. It would not have the content identifier
information that A did. It seems that in this case, B needs to re-download
everything, to build up the map of content identifiers.
(Anything could have changed since the last time A imported).
That seems too expensive!
Would storing content identifiers in the git-annex branch be too
expensive? Probably not.. For S3 with versioning a content identifier is
already stored. When the content identifier is (mtime, size, inode),
that's a small amount of data. The maximum size of a content identifier
could be limited to the size of a typical hash, and if a remote for some
reason gets something larger, it could simply hash it to generate
the content identifier.
## race conditions TODO
A file could be modified on the remote while
it's being exported, and if the remote then uses the mtime of the modified
file in the content identifier, the modification would never be noticed by
imports.
To fix this race, we need an atomic move operation on the remote. Upload
the file to a temp file, then get its content identifier, and then move it
from the temp file to its final location. Alternatively, upload a file and
get the content identifier atomically, which eg S3 with versioning enabled
provides. It would make sense to have the storeExport operation always return
a content identifier and document that it needs to get it atomically by
either using a temp file or something specific to the remote.
----
There's also a race where a file gets changed on the remote after an
import tree, and an export then overwrites it with something else.
One solution would be to only allow one of importtree or exporttree
to a given remote. This reduces the use cases a lot though, and perhaps
so far that the import tree feature is not worth building. The adb
special remote needs both. Also, such a limitation seems like one that
users might try to work around by initializing two remotes using the same
data and trying to use one for import and the other for export.
Really fixing this race needs locking or an atomic operation. Locking seems
unlikely to be a portable enough solution.
An atomic rename operation could at least narrow the race significantly, eg:
1. get content identifier of $file, check if it's what was expected else
abort (optional but would catch most problems)
2. upload new version of $file to $tmp1
3. rename current $file to $tmp2
4. Get content identifier of $tmp2, check if it's what was expected to
be. If not, $file was modified after the last import tree, and that
conflict has to be resolved. Otherwise, delete $tmp2
5. rename $tmp1 to $file
That leaves a race if the file gets overwritten after it's moved out
of the way. If the rename refuses to overwrite existing files, that race
would be detected by it failing. renameat(2) with `RENAME_NOREPLACE` can do that,
but probably many special remote interfaces don't provide a way to do that.
S3 lacks a rename operation, can only copy and then delete. Which is not
good enough; it risks the file being replaced with new content before
the delete and the new content being deleted.
Is this race really a significant problem? One way to look at it is
analagous to a git merge overwriting a locally modified file.
Git can certianly use similar techniques to entirely detect and recover
from such races (but not the similar race described in the next section).
But, git does not actually do that! I modified git's
merge.c to sleep for 10 seconds after `refresh_index()`, and verified
that changes made to the work tree in that window were silently overwritten
by git merge. In git's case, the race window is normally quite narrow
and this is very unlikely to happen (the similar race described in the next
section is more likely).
If git-annex could get the race window similarly small out would perhaps be
ok. Eg:
1. upload new version of $file to $tmp
2. get content identifier of $file, check if it's what was expected else
abort
3. rename (or copy and delete) $tmp to $file
The race window between #2 and #3 could be quite narrow for some remotes.
But S3, lacking a rename, does a copy that can be very slow for large files.
S3, with versioning, could detect the race after the fact, by listing
the versions of the file, and checking if any of the versions is one
that git-annex did not know the file already had.
[Using this api](https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGETVersion.html),
with version-id-marker set to the previous version of the file,
should list only the previous and current versions; if there's an
intermediate version then the race occurred and it could roll the change
back, or otherwise recover the overwritten version. This could be done at
import time, to detect a previous race, and recover from it; importing
a tree with the file(s) that were overwritten due to the race, leading to a
tree import conflict that the user can resolve. This likely generalizes
to importing a sequence of trees, so each version written to S3 gets
imported.
----
A remaining race is that, if the file is open for write at the same
time it's renamed, the write might happen after the content identifer
is checked, and then whatever is written to it will be lost.
But: Git worktree update has the same race condition. Verified with
this perl oneliner, run in a worktree and a second later
followed by a git pull. The lines that it appended to the
file got lost:
perl -e 'open (OUT, ">>foo") || die "$!"; sleep(10); while (<>) { print OUT $_ }'
Since this is acceptable in git, I suppose we can accept it here too..
----
If multiple repos can access the remote at the same time, then there's a
potential problem when one is exporting a new tree, and the other one is
importing from the remote.
> This can be reduced to the same problem as exports of two
> different trees to the same remote, which is already handled with the
> export log.
>
> Once a tree has been imported from the remote, it's
> in the same state as exporting that same tree to the remote, so
> update the export log to say that the remote has that treeish exported
> to it. A conflict between two export log entries will be handled as
> usual, with the user being prompted to re-export the tree they want
> to be on the remote. (May need to reword that prompt.)
> --[[Joey]]
## api design
Pulling all of the above together, this is an extension to the
ExportActions api.
listContents :: Annex (Tree [(ExportLocation, ContentIdentifier)])
getContentIdentifier :: ExportLocation -> Annex (Maybe ContentIdentifier)
retrieveExportWithContentIdentifier :: ExportLocation -> ContentIdentifier -> (FilePath -> Annex Key) -> MeterUpdate -> Annex (Maybe Key)
storeExportWithContentIdentifier :: FilePath -> Key -> ExportLocation -> MeterUpdate -> Annex (Maybe ContentIdentifier)
listContents finds the current set of files that are stored in the remote,
some of which may have been written by other programs than git-annex,
along with their content identifiers. It returns a list of those, often in
a single node tree.
listContents may also find past versions of files that are stored in the
remote, when it supports storing multiple versions of files. Since it
returns a tree of lists of files, it can represent anything from a linear
history to a full branching version control history.
retrieveExportWithContentIdentifier is used when downloading a new file from
the remote that listContents found. retrieveExport can't be used because
it has a Key parameter and the key is not yet known in this case.
(The callback generating a key will let eg S3 record the S3 version id for
the key.)
retrieveExportWithContentIdentifier should detect when the file it's
downloaded may not match the requested content identifier (eg when
something else wrote to it), and fail in that case.
storeExportWithContentIdentifier is used to get the content identifier
corresponding to what it stores. It can either get the content
identifier in reply to the store (as S3 does with versioning), or it can
store to a temp location, get the content identifier of that, and then
rename the content into place.
storeExportWithContentIdentifier must avoid overwriting any file that may
have been written to the remote by something else (unless that version of
the file can later be recovered by listContents), so it will typically
need to query for the content identifier before moving the new content
into place.
storeExportWithContentIdentifier needs to handle the case when there's a
race with a concurrent writer. It needs to avoid getting the wrong
ContentIdentifier for data written by the other writer. It may detect such
races and fail, or it could succeed and overwrite the other file, so long
as it can later be recovered by listContents.
----
See also, [[adb_special_remote]]