46d33e804a
Ugh, don't like needing to add this, but I can't see a way around it.
302 lines
14 KiB
Markdown
302 lines
14 KiB
Markdown
Importing trees from special remotes allows data published by others to be
|
|
gathered. It also combines with [[exporting_trees_to_special_remotes]]
|
|
to let a special remote act as a kind of git working tree without `.git`,
|
|
that the user can alter data in as they like and use git-annex to pull
|
|
their changes into the local repository's version control.
|
|
|
|
(See also [[todo/import_tree]].)
|
|
|
|
The basic idea is to have a `git annex import --from remote` command.
|
|
|
|
It would find changed/new/deleted files on the remote.
|
|
Download the changed/new files and inject into the annex.
|
|
And then generate a commit that can be merged (by the command or later by
|
|
the user) to make their branch reflect changes made on the remote.
|
|
|
|
## generating commits and tracking branches
|
|
|
|
For the merge to work correctly, the parent of the generated commit
|
|
needs to be, when possible, a commit whose tree corresponds to the last
|
|
tree that was exported to the remote. This way, git merge will treat the
|
|
remote the same as a normal git remote where changes were made.
|
|
|
|
If the last exported commit is not known, it would need to make a commit
|
|
with no parent. git merge would then need --allow-unrelated-histories,
|
|
and it would be more likely for the merge to have conflicts.
|
|
|
|
The export log does not record the last exported commit though, only the
|
|
tree. And the exported tree may not be the tree of any commit in the
|
|
history; it's often a subtree.
|
|
|
|
Should the last exported commit be stored in the git-annex branch?
|
|
Could be done, but maybe it's not needed.. What the user probably expects
|
|
is that, since importing is like pulling from a remote, and exporting is
|
|
like pushing, for there to be a remote tracking branch that is updated. Eg,
|
|
"refs/remotes/S3/master". The special remote is not a git repo with
|
|
branch, so doesn't really have a master branch of its own, but this naming
|
|
means that the user can "git merge S3" to merge in the imported tree.
|
|
|
|
If the user starts off in one repository, and later changes to using a
|
|
different repository to import from the same special remote, the tracking
|
|
branch would not be present there. So import would need to make a new branch
|
|
with no parent, and they would have to use --allow-unrelated-histories.
|
|
Perhaps the user could first export to the special remote, to get the
|
|
branch set up, and then import. Assuming that exporting in this situation
|
|
won't overwrite modified file on the special remote (see API below) and
|
|
will succeed enough to update the tracking branch.
|
|
|
|
Seems best to start with a remote tracking branch, since the user is going
|
|
to expect there to be one, and if it later turns out that the last exported
|
|
commit needs to be available across clones, store it in the git-annex
|
|
branch then.
|
|
|
|
## export conflict resolution
|
|
|
|
What if the export log indicates an unresolved export conflict,
|
|
and the user tries to import from the special remote?
|
|
|
|
Well, two trees got exported to the remote independently. The content of
|
|
the remote is not known to export code when there's a conflict, but import
|
|
will resolve that by getting a list of its contents. Although that may be
|
|
an admixture of the two exported trees, and so not necessarily a change the
|
|
user will want to merge into master.
|
|
|
|
One approach is to not allow imports in this situation; require the export
|
|
conflict be resolved first. (--force could override if the user just wants
|
|
to import whatever ended up on the special remote.)
|
|
|
|
Another approach, if the commits that contain the trees that were exported
|
|
is known, is to do the import and make a commit that uses those commits
|
|
as its parents. Which nicely indicates to git that there was a conflict and
|
|
it got "resolved".
|
|
|
|
## command line interface
|
|
|
|
`git annex import master --from foo` will import a tree from the remote
|
|
and update the "refs/remotes/foo/master" tracking branch to that tree.
|
|
|
|
Users will want a way to import files from a remote into a subdirectory,
|
|
and by analogy to how `git annex export` handles that, it should be
|
|
"master:subdir". So, `git annex import master:subdir --from foo`
|
|
will import a tree from the remote and graft it into the current master
|
|
branch at subdir (replacing whatever's there), storing the result in
|
|
the "refs/remotes/foo/master" tracking branch.
|
|
|
|
Note that while export can have a particular commit or tree sha specified,
|
|
it does not makes sense to import *to* a particular sha.
|
|
|
|
Should `git annex import` merge the tracking branch by itself, or leave it
|
|
up to the user? Seems most ergonomic to merge by default; if the user
|
|
wants to not merge it could be `git annex import --fetch --from remote`
|
|
or a separate command.
|
|
|
|
Also, there should be a way to configure the default tracking branch, so
|
|
`git annex sync --content` first imports from a remote, merges that, and
|
|
then exports to it. Currently `git annex export` has `--tracking` to
|
|
configure the latter. It seems to only make sense to import and export the
|
|
same tracking branch. So, should `git annex export --tracking` set the same
|
|
thing, or perhaps it would be better to move the tracking branch
|
|
configuration out of `git annex export` and into an interface that
|
|
explicitly configures both import and export?
|
|
|
|
## content identifiers
|
|
|
|
The remote is responsible for collecting a list of
|
|
files currently in it, along with some content identifier. That data is
|
|
sent to git-annex. git-annex keeps track of which content identifier(s) map
|
|
to which keys, and uses the information to determine when a file on the
|
|
remote has changed or is new.
|
|
|
|
git-annex can simply build git tree objects as the file list
|
|
comes in, looking up the key corresponding to each content identifier
|
|
(or downloading the content from the remote and adding it to the annex
|
|
when there's no corresponding key yet). It might be possible to avoid
|
|
git-annex buffering much tree data in memory.
|
|
|
|
----
|
|
|
|
A good content identifier needs to:
|
|
|
|
* Be stable, so when a file has not changed, the content identifier
|
|
remains the same.
|
|
* Change when a file is modified.
|
|
* Be as unique as possible, but not necessarily fully unique.
|
|
A hash of the content would be ideal.
|
|
A (size, mtime, inode) tuple is as good a content identifier as git uses in
|
|
its index.
|
|
|
|
git-annex will need a way to get the content identifiers of files
|
|
that it stores on the remote when exporting a tree to it, so it can later
|
|
know if those files have changed.
|
|
|
|
----
|
|
|
|
The content identifier needs to be stored somehow for later use.
|
|
|
|
It would be good to store the content identifiers only locally, if
|
|
possible.
|
|
|
|
Would local storage pose a problem when multiple repositories import from
|
|
the same remote? In that case, perhaps different trees would be imported,
|
|
and merged into master. So the two repositories then have differing
|
|
masters, which can be reconciled in merge as usual.
|
|
|
|
Since exporttree remotes don't have content identifier information yet, it
|
|
needs to be collected the first time import tree is used. (Or import
|
|
everything, but that is probably too expensive). Any modifications made to
|
|
exported files before the first import tree would not be noticed. Seems
|
|
acceptible as long as this only affects exporttree remotes created before
|
|
this feature was added.
|
|
|
|
What if repo A is being used to import tree from R for a while, and the
|
|
user gets used to editing files on R and importing them. Then they stop
|
|
using A and switch to clone B. It would not have the content identifier
|
|
information that A did. It seems that in this case, B needs to re-download
|
|
everything, to build up the map of content identifiers.
|
|
(Anything could have changed since the last time A imported).
|
|
That seems too expensive!
|
|
|
|
Would storing content identifiers in the git-annex branch be too
|
|
expensive? Probably not.. For S3 with versioning a content identifier is
|
|
already stored. When the content identifier is (mtime, size, inode),
|
|
that's a small amount of data. The maximum size of a content identifier
|
|
could be limited to the size of a typical hash, and if a remote for some
|
|
reason gets something larger, it could simply hash it to generate
|
|
the content identifier.
|
|
|
|
## safety
|
|
|
|
Since the special remote can be written to at any time by something other
|
|
than git-annex, git-annex needs to take care when exporting to it, to avoid
|
|
overwriting such changes.
|
|
|
|
This is similar to how git merge avoids overwriting modified files in the
|
|
working tree.
|
|
|
|
Surprisingly, git merge doesn't avoid overwrites in all conditions! I
|
|
modified git's merge.c to sleep for 10 seconds after `refresh_index()`, and
|
|
verified that changes made to the work tree in that window were silently
|
|
overwritten by git merge. In git's case, the race window is normally quite
|
|
narrow and this is very unlikely to happen.
|
|
|
|
Also, git merge can overwrite a file that a process has open for write;
|
|
the processes's changes then get lost. Verified with
|
|
this perl oneliner, run in a worktree and a second later
|
|
followed by a git pull. The lines that it appended to the
|
|
file got lost:
|
|
|
|
perl -e 'open (OUT, ">>foo") || die "$!"; sleep(10); while (<>) { print OUT $_ }'
|
|
|
|
git-annex should take care to be at least as safe as git merge when
|
|
exporting to a special remote that supports imports.
|
|
|
|
The situations to keep in mind are these:
|
|
|
|
1. File is changed on the remote after an import tree, and an export wants
|
|
to also change it. Need to avoid the export overwriting the
|
|
file. Or, need a way to detect such an overwrite and recover the version
|
|
of the file that got overwritten, after the fact.
|
|
|
|
2. File is changed on the remote while it's being imported, and part of one
|
|
version + part of the other version is downloaded. Need to detect this
|
|
and fail the import.
|
|
|
|
3. File is changed on the remote after its content identifier is checked
|
|
and before it's downloaded, so the wrong version gets downloaded.
|
|
Need to detect this and fail the import.
|
|
|
|
## api design
|
|
|
|
This is an extension to the ExportActions api.
|
|
|
|
listContents :: Annex (ContentHistory [(ExportLocation, ContentIdentifier)])
|
|
|
|
retrieveExportWithContentIdentifier :: ExportLocation -> ContentIdentifier -> (FilePath -> Annex Key) -> MeterUpdate -> Annex (Maybe Key)
|
|
|
|
storeExportWithContentIdentifier :: FilePath -> Key -> ExportLocation -> [ContentIdentifier] -> MeterUpdate -> Annex (Maybe ContentIdentifier)
|
|
|
|
removeExportWithContentIdentifier :: Key -> ExportLocation -> [ContentIdentifier] -> Annex Bool
|
|
|
|
removeExportDirectoryWhenEmpty :: Maybe (ExportDirectory -> Annex Bool)
|
|
|
|
checkPresentExportWithContentIdentifier :: Key -> ExportLocation -> [ContentIdentifier] -> Annex (Maybe Bool)
|
|
|
|
listContents finds the current set of files that are stored in the remote,
|
|
some of which may have been written by other programs than git-annex,
|
|
along with their content identifiers. It returns a list of those, often in
|
|
a single node tree.
|
|
|
|
listContents may also find past versions of files that are stored in the
|
|
remote, when it supports storing multiple versions of files. Since it
|
|
returns a history tree of lists of files, it can represent anything from a
|
|
linear history to a full branching version control history.
|
|
|
|
retrieveExportWithContentIdentifier is used when downloading a new file from
|
|
the remote that listContents found. retrieveExport can't be used because
|
|
it has a Key parameter and the key is not yet known in this case.
|
|
(The callback generating a key will let eg S3 record the S3 version id for
|
|
the key.)
|
|
|
|
retrieveExportWithContentIdentifier should detect when the file it's
|
|
downloaded may not match the requested content identifier (eg when
|
|
something else wrote to it while it was being retrieved), and fail
|
|
in that case.
|
|
|
|
When a remote supports imports and exports, storeExport and removeExport
|
|
should not be used when exporting to it, and instead
|
|
storeExportWithContentIdentifier and removeExportWithContentIdentifier
|
|
be used.
|
|
|
|
storeExportWithContentIdentifier stores content and returns the
|
|
content identifier corresponding to what it stored. It can either get
|
|
the content identifier in reply to the store (as S3 does with versioning),
|
|
or it can store to a temp location, get the content identifier of that,
|
|
and then rename the content into place.
|
|
|
|
storeExportWithContentIdentifier must avoid overwriting any existing file
|
|
on the remote, unless the file has the same content identifier that's passed
|
|
to it, to avoid overwriting a file that was modified by something else.
|
|
But alternatively, if listContents can later recover the modified file, it can
|
|
overwrite the modified file.
|
|
|
|
Similarly, removeExportWithContentIdentifier must only remove a file
|
|
on the remote if it has the same content identifier that's passed to it,
|
|
or if listContent can later recover the modified file.
|
|
Otherwise it should fail. (Like removeExport, removeExportWithContentIdentifier
|
|
also succeeds if the file is not present.)
|
|
|
|
Both storeExportWithContentIdentifier and removeExportWithContentIdentifier
|
|
need to handle the case when there's a race with a concurrent writer.
|
|
They can detect such races and fail. Or, if overwritten/deleted modified
|
|
files can later be recovered by listContents, it's acceptable to not detect
|
|
the race.
|
|
|
|
removeExportDirectoryWhenEmpty is used instead of removeExportDirectory.
|
|
It should only remove empty directories, and succeeds if there are files
|
|
in the directory.
|
|
|
|
checkPresentExportWithContentIdentifier is used instead of
|
|
checkPresentExport. It should verify that one of the provided
|
|
ContentIdentifiers matches the current content of the file.
|
|
|
|
Note that renameExport is never used when the special remote supports
|
|
imports, because it may have an implementation that loses changes
|
|
to imported files. (For example, it may copy the file to the new name,
|
|
and delete the old name.)
|
|
|
|
## multiple git-annex repos accessing a special remote
|
|
|
|
If multiple repos can access the remote at the same time, then there's a
|
|
potential problem when one is exporting a new tree, and the other one is
|
|
importing from the remote.
|
|
|
|
This can be reduced to the same problem as exports of two
|
|
different trees to the same remote, which is already handled with the
|
|
export log.
|
|
|
|
Once a tree has been imported from the remote, it's
|
|
in the same state as exporting that same tree to the remote, so
|
|
update the export log to say that the remote has that treeish exported
|
|
to it. A conflict between two export log entries will be handled as
|
|
usual, with the user being prompted to re-export the tree they want
|
|
to be on the remote. (May need to reword that prompt.)
|