git-annex/doc/design/importing_trees_from_special_remotes.mdwn
2019-02-22 22:02:50 -04:00

242 lines
11 KiB
Markdown

Importing trees from special remotes allows data published by others to be
gathered. It also combines with [[exporting_trees_to_special_remotes]]
to let a special remote act as a kind of git working tree without `.git`,
that the user can alter data in as they like and use git-annex to pull
their changes into the local repository's version control.
(See also [[todo/import_tree]].)
The basic idea is to have a `git annex import --from remote` command.
It would find changed/new/deleted files on the remote.
Download the changed/new files and inject into the annex.
And then generate a commit that can be merged (by the command or later by
the user) to make their branch reflect changes made on the remote.
## generating commits and merging
For the merge to work correctly, the parent of the generated commit
needs to be, when possible, a commit whose tree corresponds to the last
tree that was exported to the remote. This way, git merge will treat the
remote the same as a normal git remote where changes were made.
The export log does not record the last exported commit though, only the
tree. And the exported tree may not be the tree of any commit in the
history; it's often a subtree.
So, the export log needs to get a commit sha added to it. And it's possible
that commit will get garbage collected or not pushed, and so not be
available. It could be linked into the git-annex branch as is done for the
exported tree, but doing that for a commit is pretty strange. It's also
possible for the user to export a tree by sha, so there's no commit.
And of course, if no export has been done yet, there would be no commit.
If the last exported commit is not accessible, or not recorded, seems it
would be ok to make a commit with no parent. git merge would then need
--allow-unrelated-histories, and it would be more likely for the merge to
have conflicts.
It's also possible for the export log to indicate an unresolved export
conflict, so two trees got exported to the remote independently. The
content of the remote is not known at this point, but import will resolve
that by getting a list of its contents. So, in this case, use the multiple
commits that are in the export log as the parent of the generated commit,
which nicely indicates to git that there was a conflict and it got
resolved.
## command line interface
`git annex import --from remote` would import files from the remote to the
top of the working tree. Sometimes users will want to import into a
subdirectory, so there should be a way to do that.
`git annex export` has its own way to specify a subdirectory to export,
eg "master:subdir" (which is one way of referring to a git tree in git).
So it seems it would make sense to make importing use a similar syntax.
When importing, "master:subdir" would mean to import into a tree at subdir,
and merge it into master. So any branch ref not containing a colon, eg
"master" naturally means import not in a subdir, and merge it into the
branch.
Note that while export can have a particular commit or tree sha specified,
it does not makes sense to import *to* a particular sha.
Also, there should be a way to configure it so `git annex sync --content`
first imports from a remote and then exports to it. Currently `git annex
export` has `--tracking` to configure the latter. It seems to only make
sense to import and export the same tracking branch. So, should `git annex
export --tracking` set the same thing, or perhaps it would be better to
move the tracking branch configuration out of `git annex export` and into
an interface that explicitly configures both import and export?
## content identifiers
The remote is responsible for collecting a list of
files currently in it, along with some content identifier. That data is
sent to git-annex. git-annex keeps track of which content identifier(s) map
to which keys, and uses the information to determine when a file on the
remote has changed or is new.
git-annex can simply build git tree objects as the file list
comes in, looking up the key corresponding to each content identifier
(or downloading the content from the remote and adding it to the annex
when there's no corresponding key yet). It might be possible to avoid
git-annex buffering much tree data in memory.
----
A good content identifier needs to:
* Be stable, so when a file has not changed, the content identifier
remains the same.
* Change when a file is modified.
* Be as unique as possible, but not necessarily fully unique.
A hash of the content would be ideal.
A (size, mtime, inode) tuple is as good a content identifier as git uses in
its index.
git-annex will need a way to get the content identifiers of files
that it stores on the remote when exporting a tree to it, so it can later
know if those files have changed.
----
The content identifier needs to be stored somehow for later use.
It would be good to store the content identifiers only locally, if
possible.
Would local storage pose a problem when multiple repositories import from
the same remote? In that case, perhaps different trees would be imported,
and merged into master. So the two repositories then have differing
masters, which can be reconciled in merge as usual.
Since exporttree remotes don't have content identifier information yet, it
needs to be collected the first time import tree is used. (Or import
everything, but that is probably too expensive). Any modifications made to
exported files before the first import tree would not be noticed. Seems
acceptible as long as this only affects exporttree remotes created before
this feature was added.
What if repo A is being used to import tree from R for a while, and the
user gets used to editing files on R and importing them. Then they stop
using A and switch to clone B. It would not have the content identifier
information that A did. It seems that in this case, B needs to re-download
everything, to build up the map of content identifiers.
(Anything could have changed since the last time A imported).
That seems too expensive!
Would storing content identifiers in the git-annex branch be too
expensive? Probably not.. For S3 with versioning a content identifier is
already stored. When the content identifier is (mtime, size, inode),
that's a small amount of data. The maximum size of a content identifier
could be limited to the size of a typical hash, and if a remote for some
reason gets something larger, it could simply hash it to generate
the content identifier.
## safety
Since the special remote can be written to at any time by something other
than git-annex, git-annex needs to take care when exporting to it, to avoid
overwriting such changes.
This is similar to how git merge avoids overwriting modified files in the
working tree.
Surprisingly, git merge doesn't avoid overwrites in all conditions! I
modified git's merge.c to sleep for 10 seconds after `refresh_index()`, and
verified that changes made to the work tree in that window were silently
overwritten by git merge. In git's case, the race window is normally quite
narrow and this is very unlikely to happen.
Also, git merge can overwrite a file that a process has open for write;
the processes's changes then get lost. Verified with
this perl oneliner, run in a worktree and a second later
followed by a git pull. The lines that it appended to the
file got lost:
perl -e 'open (OUT, ">>foo") || die "$!"; sleep(10); while (<>) { print OUT $_ }'
git-annex should take care to be at least as safe as git merge when
exporting to a special remote that supports imports.
The situations to keep in mind are these:
1. File is changed on the remote after an import tree, and an export wants
to also change it. Need to avoid the export overwriting the
file. Or, need a way to detect such an overwrite and recover the version
of the file that got overwritten, after the fact.
2. File is changed on the remote while it's being imported, and part of one
version + part of the other version is downloaded. Need to detect this
and fail the import.
3. File is changed on the remote after its content identifier is checked
and before it's downloaded, so the wrong version gets downloaded.
Need to detect this and fail the import.
## api design
This is an extension to the ExportActions api.
listContents :: Annex (Tree [(ExportLocation, ContentIdentifier)])
retrieveExportWithContentIdentifier :: ExportLocation -> ContentIdentifier -> (FilePath -> Annex Key) -> MeterUpdate -> Annex (Maybe Key)
storeExportWithContentIdentifier :: FilePath -> Key -> ExportLocation -> Maybe ContentIdentifier -> MeterUpdate -> Annex (Maybe ContentIdentifier)
listContents finds the current set of files that are stored in the remote,
some of which may have been written by other programs than git-annex,
along with their content identifiers. It returns a list of those, often in
a single node tree.
listContents may also find past versions of files that are stored in the
remote, when it supports storing multiple versions of files. Since it
returns a tree of lists of files, it can represent anything from a linear
history to a full branching version control history.
retrieveExportWithContentIdentifier is used when downloading a new file from
the remote that listContents found. retrieveExport can't be used because
it has a Key parameter and the key is not yet known in this case.
(The callback generating a key will let eg S3 record the S3 version id for
the key.)
retrieveExportWithContentIdentifier should detect when the file it's
downloaded may not match the requested content identifier (eg when
something else wrote to it while it was being retrieved), and fail
in that case.
storeExportWithContentIdentifier stores content and returns the
content identifier corresponding to what it stored. It can either get
the content identifier in reply to the store (as S3 does with versioning),
or it can store to a temp location, get the content identifier of that,
and then rename the content into place.
storeExportWithContentIdentifier must avoid overwriting any existing file
on the remote, unless the file has the same content identifier that's passed
to it, to avoid overwriting a file that was modified by something else.
But alternatively, if listContents can later recover the modified file, it can
overwrite the modified file.
storeExportWithContentIdentifier needs to handle the case when there's a
race with a concurrent writer. It needs to avoid getting the wrong
ContentIdentifier for data written by the other writer. It may detect such
races and fail, or it could succeed and overwrite the other file, so long
as it can later be recovered by listContents.
## multiple git-annex repos accessing a special remote
If multiple repos can access the remote at the same time, then there's a
potential problem when one is exporting a new tree, and the other one is
importing from the remote.
This can be reduced to the same problem as exports of two
different trees to the same remote, which is already handled with the
export log.
Once a tree has been imported from the remote, it's
in the same state as exporting that same tree to the remote, so
update the export log to say that the remote has that treeish exported
to it. A conflict between two export log entries will be handled as
usual, with the user being prompted to re-export the tree they want
to be on the remote. (May need to reword that prompt.)