251 lines
11 KiB
Markdown
251 lines
11 KiB
Markdown
When `git annex export treeish --to remote` is used to export to a remote,
|
|
and the remote allows files to somehow be edited on it, then there ought
|
|
to be a way to import the changes back from the remote into the git repository.
|
|
The command could be `git annex import --from remote`
|
|
|
|
It would find changed/new/deleted files on the remote.
|
|
Download the changed/new files and inject into the annex.
|
|
Generate a new treeish, with parent the treeish that was exported,
|
|
that has the modifications in it.
|
|
|
|
Updating the working copy is then done by merging the import treeish.
|
|
This way, conflicts will be detected and handled as normal by git.
|
|
|
|
## content identifiers
|
|
|
|
The remote is responsible for collecting a list of
|
|
files currently in it, along with some content identifier. That data is
|
|
sent to git-annex. git-annex keeps track of which content identifier(s) map
|
|
to which keys, and uses the information to determine when a file on the
|
|
remote has changed or is new.
|
|
|
|
git-annex can simply build git tree objects as the file list
|
|
comes in, looking up the key corresponding to each content identifier
|
|
(or downloading the content from the remote and adding it to the annex
|
|
when there's no corresponding key yet). It might be possible to avoid
|
|
git-annex buffering much tree data in memory.
|
|
|
|
----
|
|
|
|
A good content identifier needs to:
|
|
|
|
* Be stable, so when a file has not changed, the content identifier
|
|
remains the same.
|
|
* Change when a file is modified.
|
|
* Be as unique as possible, but not necessarily fully unique.
|
|
A hash of the content would be ideal.
|
|
A (size, mtime, inode) tuple is as good a content identifier as git uses in
|
|
its index.
|
|
|
|
git-annex will need a way to get the content identifiers of files
|
|
that it stores on the remote when exporting a tree to it, so it can later
|
|
know if those files have changed.
|
|
|
|
----
|
|
|
|
The content identifier needs to be stored somehow for later use.
|
|
|
|
It would be good to store the content identifiers only locally, if
|
|
possible.
|
|
|
|
Would local storage pose a problem when multiple repositories import from
|
|
the same remote? In that case, perhaps different trees would be imported,
|
|
and merged into master. So the two repositories then have differing
|
|
masters, which can be reconciled in merge as usual.
|
|
|
|
Since exporttree remotes don't have content identifier information yet, it
|
|
needs to be collected the first time import tree is used. (Or import
|
|
everything, but that is probably too expensive). Any modifications made to
|
|
exported files before the first import tree would not be noticed. Seems
|
|
acceptible as long as this only affects exporttree remotes created before
|
|
this feature was added.
|
|
|
|
What if repo A is being used to import tree from R for a while, and the
|
|
user gets used to editing files on R and importing them. Then they stop
|
|
using A and switch to clone B. It would not have the content identifier
|
|
information that A did. It seems that in this case, B needs to re-download
|
|
everything, to build up the map of content identifiers.
|
|
(Anything could have changed since the last time A imported).
|
|
That seems too expensive!
|
|
|
|
Would storing content identifiers in the git-annex branch be too
|
|
expensive? Probably not.. For S3 with versioning a content identifier is
|
|
already stored. When the content identifier is (mtime, size, inode),
|
|
that's a small amount of data. The maximum size of a content identifier
|
|
could be limited to the size of a typical hash, and if a remote for some
|
|
reason gets something larger, it could simply hash it to generate
|
|
the content identifier.
|
|
|
|
## race conditions TODO
|
|
|
|
A file could be modified on the remote while
|
|
it's being exported, and if the remote then uses the mtime of the modified
|
|
file in the content identifier, the modification would never be noticed by
|
|
imports.
|
|
|
|
To fix this race, we need an atomic move operation on the remote. Upload
|
|
the file to a temp file, then get its content identifier, and then move it
|
|
from the temp file to its final location. Alternatively, upload a file and
|
|
get the content identifier atomically, which eg S3 with versioning enabled
|
|
provides. It would make sense to have the storeExport operation always return
|
|
a content identifier and document that it needs to get it atomically by
|
|
either using a temp file or something specific to the remote.
|
|
|
|
----
|
|
|
|
There's also a race where a file gets changed on the remote after an
|
|
import tree, and an export then overwrites it with something else.
|
|
|
|
One solution would be to only allow one of importtree or exporttree
|
|
to a given remote. This reduces the use cases a lot though, and perhaps
|
|
so far that the import tree feature is not worth building. The adb
|
|
special remote needs both. Also, such a limitation seems like one that
|
|
users might try to work around by initializing two remotes using the same
|
|
data and trying to use one for import and the other for export.
|
|
|
|
Really fixing this race needs locking or an atomic operation. Locking seems
|
|
unlikely to be a portable enough solution.
|
|
|
|
An atomic rename operation could at least narrow the race significantly, eg:
|
|
|
|
1. get content identifier of $file, check if it's what was expected else
|
|
abort (optional but would catch most problems)
|
|
2. upload new version of $file to $tmp1
|
|
3. rename current $file to $tmp2
|
|
4. Get content identifier of $tmp2, check if it's what was expected to
|
|
be. If not, $file was modified after the last import tree, and that
|
|
conflict has to be resolved. Otherwise, delete $tmp2
|
|
5. rename $tmp1 to $file
|
|
|
|
That leaves a race if the file gets overwritten after it's moved out
|
|
of the way. If the rename refuses to overwrite existing files, that race
|
|
would be detected by it failing. renameat(2) with `RENAME_NOREPLACE` can do that,
|
|
but probably many special remote interfaces don't provide a way to do that.
|
|
|
|
S3 lacks a rename operation, can only copy and then delete. Which is not
|
|
good enough; it risks the file being replaced with new content before
|
|
the delete and the new content being deleted.
|
|
|
|
Is this race really a significant problem? One way to look at it is
|
|
analagous to a git merge overwriting a locally modified file.
|
|
Git can certianly use similar techniques to entirely detect and recover
|
|
from such races (but not the similar race described in the next section).
|
|
But, git does not actually do that! I modified git's
|
|
merge.c to sleep for 10 seconds after `refresh_index()`, and verified
|
|
that changes made to the work tree in that window were silently overwritten
|
|
by git merge. In git's case, the race window is normally quite narrow
|
|
and this is very unlikely to happen (the similar race described in the next
|
|
section is more likely).
|
|
|
|
If git-annex could get the race window similarly small out would perhaps be
|
|
ok. Eg:
|
|
|
|
1. upload new version of $file to $tmp
|
|
2. get content identifier of $file, check if it's what was expected else
|
|
abort
|
|
3. rename (or copy and delete) $tmp to $file
|
|
|
|
The race window between #2 and #3 could be quite narrow for some remotes.
|
|
But S3, lacking a rename, does a copy that can be very slow for large files.
|
|
|
|
S3, with versioning, could detect the race after the fact, by listing
|
|
the versions of the file, and checking if any of the versions is one
|
|
that git-annex did not know the file already had.
|
|
[Using this api](https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGETVersion.html),
|
|
with version-id-marker set to the previous version of the file,
|
|
should list only the previous and current versions; if there's an
|
|
intermediate version then the race occurred and it could roll the change
|
|
back, or otherwise recover the overwritten version. This could be done at
|
|
import time, to detect a previous race, and recover from it; importing
|
|
a tree with the file(s) that were overwritten due to the race, leading to a
|
|
tree import conflict that the user can resolve. This likely generalizes
|
|
to importing a sequence of trees, so each version written to S3 gets
|
|
imported.
|
|
|
|
----
|
|
|
|
A remaining race is that, if the file is open for write at the same
|
|
time it's renamed, the write might happen after the content identifer
|
|
is checked, and then whatever is written to it will be lost.
|
|
|
|
But: Git worktree update has the same race condition. Verified with
|
|
this perl oneliner, run in a worktree and a second later
|
|
followed by a git pull. The lines that it appended to the
|
|
file got lost:
|
|
|
|
perl -e 'open (OUT, ">>foo") || die "$!"; sleep(10); while (<>) { print OUT $_ }'
|
|
|
|
Since this is acceptable in git, I suppose we can accept it here too..
|
|
|
|
----
|
|
|
|
If multiple repos can access the remote at the same time, then there's a
|
|
potential problem when one is exporting a new tree, and the other one is
|
|
importing from the remote.
|
|
|
|
> This can be reduced to the same problem as exports of two
|
|
> different trees to the same remote, which is already handled with the
|
|
> export log.
|
|
>
|
|
> Once a tree has been imported from the remote, it's
|
|
> in the same state as exporting that same tree to the remote, so
|
|
> update the export log to say that the remote has that treeish exported
|
|
> to it. A conflict between two export log entries will be handled as
|
|
> usual, with the user being prompted to re-export the tree they want
|
|
> to be on the remote. (May need to reword that prompt.)
|
|
> --[[Joey]]
|
|
|
|
## api design
|
|
|
|
Pulling all of the above together, this is an extension to the
|
|
ExportActions api.
|
|
|
|
listContents :: Annex (Tree [(ExportLocation, ContentIdentifier)])
|
|
|
|
getContentIdentifier :: ExportLocation -> Annex (Maybe ContentIdentifier)
|
|
|
|
retrieveExportWithContentIdentifier :: ExportLocation -> ContentIdentifier -> (FilePath -> Annex Key) -> MeterUpdate -> Annex (Maybe Key)
|
|
|
|
storeExportWithContentIdentifier :: FilePath -> Key -> ExportLocation -> MeterUpdate -> Annex (Maybe ContentIdentifier)
|
|
|
|
listContents finds the current set of files that are stored in the remote,
|
|
some of which may have been written by other programs than git-annex,
|
|
along with their content identifiers. It returns a list of those, often in
|
|
a single node tree.
|
|
|
|
listContents may also find past versions of files that are stored in the
|
|
remote, when it supports storing multiple versions of files. Since it
|
|
returns a tree of lists of files, it can represent anything from a linear
|
|
history to a full branching version control history.
|
|
|
|
retrieveExportWithContentIdentifier is used when downloading a new file from
|
|
the remote that listContents found. retrieveExport can't be used because
|
|
it has a Key parameter and the key is not yet known in this case.
|
|
(The callback generating a key will let eg S3 record the S3 version id for
|
|
the key.)
|
|
|
|
retrieveExportWithContentIdentifier should detect when the file it's
|
|
downloaded may not match the requested content identifier (eg when
|
|
something else wrote to it), and fail in that case.
|
|
|
|
storeExportWithContentIdentifier is used to get the content identifier
|
|
corresponding to what it stores. It can either get the content
|
|
identifier in reply to the store (as S3 does with versioning), or it can
|
|
store to a temp location, get the content identifier of that, and then
|
|
rename the content into place.
|
|
|
|
storeExportWithContentIdentifier must avoid overwriting any file that may
|
|
have been written to the remote by something else (unless that version of
|
|
the file can later be recovered by listContents), so it will typically
|
|
need to query for the content identifier before moving the new content
|
|
into place.
|
|
|
|
storeExportWithContentIdentifier needs to handle the case when there's a
|
|
race with a concurrent writer. It needs to avoid getting the wrong
|
|
ContentIdentifier for data written by the other writer. It may detect such
|
|
races and fail, or it could succeed and overwrite the other file, so long
|
|
as it can later be recovered by listContents.
|
|
|
|
----
|
|
|
|
See also, [[adb_special_remote]]
|