dec30d2b14
Note that I tried an evil remote that lists ImportLocations with ../../../ in them and indeed this resulted in git blowing up and the import failing, and not writing outside the repo.
308 lines
14 KiB
Markdown
308 lines
14 KiB
Markdown
Importing trees from special remotes allows data published by others to be
|
|
gathered. It also combines with [[exporting_trees_to_special_remotes]]
|
|
to let a special remote act as a kind of git working tree without `.git`,
|
|
that the user can alter data in as they like and use git-annex to pull
|
|
their changes into the local repository's version control.
|
|
|
|
(See also [[todo/import_tree]].)
|
|
|
|
The basic idea is to have a `git annex import --from remote` command.
|
|
|
|
It would find changed/new/deleted files on the remote.
|
|
Download the changed/new files and inject into the annex.
|
|
And then generate a commit that can be merged (by the command or later by
|
|
the user) to make their branch reflect changes made on the remote.
|
|
|
|
## generating commits and tracking branches
|
|
|
|
For the merge to work correctly, the parent of the generated commit
|
|
needs to be, when possible, a commit whose tree corresponds to the last
|
|
tree that was exported to the remote. This way, git merge will treat the
|
|
remote the same as a normal git remote where changes were made.
|
|
|
|
If the last exported commit is not known, it would need to make a commit
|
|
with no parent. git merge would then need --allow-unrelated-histories,
|
|
and it would be more likely for the merge to have conflicts.
|
|
|
|
The export log does not record the last exported commit though, only the
|
|
tree. And the exported tree may not be the tree of any commit in the
|
|
history; it's often a subtree.
|
|
|
|
Should the last exported commit be stored in the git-annex branch?
|
|
Could be done, but maybe it's not needed.. What the user probably expects
|
|
is that, since importing is like pulling from a remote, and exporting is
|
|
like pushing, for there to be a remote tracking branch that is updated. Eg,
|
|
"refs/remotes/S3/master". The special remote is not a git repo with
|
|
branch, so doesn't really have a master branch of its own, but this naming
|
|
means that the user can "git merge S3" to merge in the imported tree.
|
|
|
|
If the user starts off in one repository, and later changes to using a
|
|
different repository to import from the same special remote, the tracking
|
|
branch would not be present there. So import would need to make a new branch
|
|
with no parent, and they would have to use --allow-unrelated-histories.
|
|
Perhaps the user could first export to the special remote, to get the
|
|
branch set up, and then import. Assuming that exporting in this situation
|
|
won't overwrite modified file on the special remote (see API below) and
|
|
will succeed enough to update the tracking branch.
|
|
|
|
Seems best to start with a remote tracking branch, since the user is going
|
|
to expect there to be one, and if it later turns out that the last exported
|
|
commit needs to be available across clones, store it in the git-annex
|
|
branch then.
|
|
|
|
## export conflict resolution
|
|
|
|
What if the export log indicates an unresolved export conflict,
|
|
and the user tries to import from the special remote?
|
|
|
|
Well, two trees got exported to the remote independently. The content of
|
|
the remote is not known to export code when there's a conflict, but import
|
|
will resolve that by getting a list of its contents. Although that may be
|
|
an admixture of the two exported trees, and so not necessarily a change the
|
|
user will want to merge into master.
|
|
|
|
One approach is to not allow imports in this situation; require the export
|
|
conflict be resolved first. (--force could override if the user just wants
|
|
to import whatever ended up on the special remote.)
|
|
|
|
Another approach, if the commits that contain the trees that were exported
|
|
is known, is to do the import and make a commit that uses those commits
|
|
as its parents. Which nicely indicates to git that there was a conflict and
|
|
it got "resolved".
|
|
|
|
## command line interface
|
|
|
|
`git annex import master --from foo` will import a tree from the remote
|
|
and update the "refs/remotes/foo/master" tracking branch to that tree.
|
|
|
|
Users will want a way to import files from a remote into a subdirectory,
|
|
and by analogy to how `git annex export` handles that, it should be
|
|
"master:subdir". So, `git annex import master:subdir --from foo`
|
|
will import a tree from the remote and graft it into the current master
|
|
branch at subdir (replacing whatever's there), storing the result in
|
|
the "refs/remotes/foo/master" tracking branch.
|
|
|
|
Note that while export can have a particular commit or tree sha specified,
|
|
it does not makes sense to import *to* a particular sha.
|
|
|
|
Should `git annex import` merge the tracking branch by itself, or leave it
|
|
up to the user? Seems most ergonomic to merge by default; if the user
|
|
wants to not merge it could be `git annex import --fetch --from remote`
|
|
or a separate command.
|
|
|
|
Also, there should be a way to configure the default tracking branch, so
|
|
`git annex sync --content` first imports from a remote, merges that, and
|
|
then exports to it. Currently `git annex export` has `--tracking` to
|
|
configure the latter. It seems to only make sense to import and export the
|
|
same tracking branch. So, should `git annex export --tracking` set the same
|
|
thing, or perhaps it would be better to move the tracking branch
|
|
configuration out of `git annex export` and into an interface that
|
|
explicitly configures both import and export?
|
|
|
|
## content identifiers
|
|
|
|
The remote is responsible for collecting a list of
|
|
files currently in it, along with some content identifier. That data is
|
|
sent to git-annex. git-annex keeps track of which content identifier(s) map
|
|
to which keys, and uses the information to determine when a file on the
|
|
remote has changed or is new.
|
|
|
|
git-annex can simply build git tree objects as the file list
|
|
comes in, looking up the key corresponding to each content identifier
|
|
(or downloading the content from the remote and adding it to the annex
|
|
when there's no corresponding key yet). It might be possible to avoid
|
|
git-annex buffering much tree data in memory.
|
|
|
|
----
|
|
|
|
A good content identifier needs to:
|
|
|
|
* Be stable, so when a file has not changed, the content identifier
|
|
remains the same.
|
|
* Change when a file is modified.
|
|
* Be as unique as possible, but not necessarily fully unique.
|
|
A hash of the content would be ideal.
|
|
A (size, mtime, inode) tuple is as good a content identifier as git uses in
|
|
its index.
|
|
|
|
git-annex will need a way to get the content identifiers of files
|
|
that it stores on the remote when exporting a tree to it, so it can later
|
|
know if those files have changed.
|
|
|
|
----
|
|
|
|
The content identifier needs to be stored somehow for later use.
|
|
|
|
It would be good to store the content identifiers only locally, if
|
|
possible.
|
|
|
|
Would local storage pose a problem when multiple repositories import from
|
|
the same remote? In that case, perhaps different trees would be imported,
|
|
and merged into master. So the two repositories then have differing
|
|
masters, which can be reconciled in merge as usual.
|
|
|
|
Since exporttree remotes don't have content identifier information yet, it
|
|
needs to be collected the first time import tree is used. (Or import
|
|
everything, but that is probably too expensive). Any modifications made to
|
|
exported files before the first import tree would not be noticed. Seems
|
|
acceptible as long as this only affects exporttree remotes created before
|
|
this feature was added.
|
|
|
|
What if repo A is being used to import tree from R for a while, and the
|
|
user gets used to editing files on R and importing them. Then they stop
|
|
using A and switch to clone B. It would not have the content identifier
|
|
information that A did. It seems that in this case, B needs to re-download
|
|
everything, to build up the map of content identifiers.
|
|
(Anything could have changed since the last time A imported).
|
|
That seems too expensive!
|
|
|
|
Would storing content identifiers in the git-annex branch be too
|
|
expensive? Probably not.. For S3 with versioning a content identifier is
|
|
already stored. When the content identifier is (mtime, size, inode),
|
|
that's a small amount of data. The maximum size of a content identifier
|
|
could be limited to the size of a typical hash, and if a remote for some
|
|
reason gets something larger, it could simply hash it to generate
|
|
the content identifier.
|
|
|
|
## safety
|
|
|
|
Since the special remote can be written to at any time by something other
|
|
than git-annex, git-annex needs to take care when exporting to it, to avoid
|
|
overwriting such changes.
|
|
|
|
This is similar to how git merge avoids overwriting modified files in the
|
|
working tree.
|
|
|
|
Surprisingly, git merge doesn't avoid overwrites in all conditions! I
|
|
modified git's merge.c to sleep for 10 seconds after `refresh_index()`, and
|
|
verified that changes made to the work tree in that window were silently
|
|
overwritten by git merge. In git's case, the race window is normally quite
|
|
narrow and this is very unlikely to happen.
|
|
|
|
Also, git merge can overwrite a file that a process has open for write;
|
|
the processes's changes then get lost. Verified with
|
|
this perl oneliner, run in a worktree and a second later
|
|
followed by a git pull. The lines that it appended to the
|
|
file got lost:
|
|
|
|
perl -e 'open (OUT, ">>foo") || die "$!"; sleep(10); while (<>) { print OUT $_ }'
|
|
|
|
git-annex should take care to be at least as safe as git merge when
|
|
exporting to a special remote that supports imports.
|
|
|
|
The situations to keep in mind are these:
|
|
|
|
1. File is changed on the remote after an import tree, and an export wants
|
|
to also change it. Need to avoid the export overwriting the
|
|
file. Or, need a way to detect such an overwrite and recover the version
|
|
of the file that got overwritten, after the fact.
|
|
|
|
2. File is changed on the remote while it's being imported, and part of one
|
|
version + part of the other version is downloaded. Need to detect this
|
|
and fail the import.
|
|
|
|
3. File is changed on the remote after its content identifier is checked
|
|
and before it's downloaded, so the wrong version gets downloaded.
|
|
Need to detect this and fail the import.
|
|
|
|
## api design
|
|
|
|
This is an extension to the ExportActions api.
|
|
|
|
listContents :: Annex (ContentHistory [(ExportLocation, ContentIdentifier)])
|
|
|
|
retrieveExportWithContentIdentifier :: ExportLocation -> ContentIdentifier -> (FilePath -> Annex Key) -> MeterUpdate -> Annex (Maybe Key)
|
|
|
|
storeExportWithContentIdentifier :: FilePath -> Key -> ExportLocation -> [ContentIdentifier] -> MeterUpdate -> Annex (Maybe ContentIdentifier)
|
|
|
|
removeExportWithContentIdentifier :: Key -> ExportLocation -> [ContentIdentifier] -> Annex Bool
|
|
|
|
removeExportDirectoryWhenEmpty :: Maybe (ExportDirectory -> Annex Bool)
|
|
|
|
checkPresentExportWithContentIdentifier :: Key -> ExportLocation -> [ContentIdentifier] -> Annex (Maybe Bool)
|
|
|
|
listContents finds the current set of files that are stored in the remote,
|
|
some of which may have been written by other programs than git-annex,
|
|
along with their content identifiers. It returns a list of those, often in
|
|
a single node tree.
|
|
|
|
listContents may also find past versions of files that are stored in the
|
|
remote, when it supports storing multiple versions of files. Since it
|
|
returns a history tree of lists of files, it can represent anything from a
|
|
linear history to a full branching version control history.
|
|
|
|
Note that listContents does not need to worry about generating an
|
|
ExportLocation that contains a ".." attack or an absolute path or other
|
|
such mischief. Since a git tree is built from the ExportLocations, and is
|
|
merged the same as a tree pulled from a regular git remote is,
|
|
git's usual safety measures avoid such attacks.
|
|
|
|
retrieveExportWithContentIdentifier is used when downloading a new file from
|
|
the remote that listContents found. retrieveExport can't be used because
|
|
it has a Key parameter and the key is not yet known in this case.
|
|
(The callback generating a key will let eg S3 record the S3 version id for
|
|
the key.)
|
|
|
|
retrieveExportWithContentIdentifier should detect when the file it's
|
|
downloaded may not match the requested content identifier (eg when
|
|
something else wrote to it while it was being retrieved), and fail
|
|
in that case.
|
|
|
|
When a remote supports imports and exports, storeExport and removeExport
|
|
should not be used when exporting to it, and instead
|
|
storeExportWithContentIdentifier and removeExportWithContentIdentifier
|
|
be used.
|
|
|
|
storeExportWithContentIdentifier stores content and returns the
|
|
content identifier corresponding to what it stored. It can either get
|
|
the content identifier in reply to the store (as S3 does with versioning),
|
|
or it can store to a temp location, get the content identifier of that,
|
|
and then rename the content into place.
|
|
|
|
storeExportWithContentIdentifier must avoid overwriting any existing file
|
|
on the remote, unless the file has the same content identifier that's passed
|
|
to it, to avoid overwriting a file that was modified by something else.
|
|
But alternatively, if listContents can later recover the modified file, it can
|
|
overwrite the modified file.
|
|
|
|
Similarly, removeExportWithContentIdentifier must only remove a file
|
|
on the remote if it has the same content identifier that's passed to it,
|
|
or if listContent can later recover the modified file.
|
|
Otherwise it should fail. (Like removeExport, removeExportWithContentIdentifier
|
|
also succeeds if the file is not present.)
|
|
|
|
Both storeExportWithContentIdentifier and removeExportWithContentIdentifier
|
|
need to handle the case when there's a race with a concurrent writer.
|
|
They can detect such races and fail. Or, if overwritten/deleted modified
|
|
files can later be recovered by listContents, it's acceptable to not detect
|
|
the race.
|
|
|
|
removeExportDirectoryWhenEmpty is used instead of removeExportDirectory.
|
|
It should only remove empty directories, and succeeds if there are files
|
|
in the directory.
|
|
|
|
checkPresentExportWithContentIdentifier is used instead of
|
|
checkPresentExport. It should verify that one of the provided
|
|
ContentIdentifiers matches the current content of the file.
|
|
|
|
Note that renameExport is never used when the special remote supports
|
|
imports, because it may have an implementation that loses changes
|
|
to imported files. (For example, it may copy the file to the new name,
|
|
and delete the old name.)
|
|
|
|
## multiple git-annex repos accessing a special remote
|
|
|
|
If multiple repos can access the remote at the same time, then there's a
|
|
potential problem when one is exporting a new tree, and the other one is
|
|
importing from the remote.
|
|
|
|
This can be reduced to the same problem as exports of two
|
|
different trees to the same remote, which is already handled with the
|
|
export log.
|
|
|
|
Once a tree has been imported from the remote, it's
|
|
in the same state as exporting that same tree to the remote, so
|
|
update the export log to say that the remote has that treeish exported
|
|
to it. A conflict between two export log entries will be handled as
|
|
usual, with the user being prompted to re-export the tree they want
|
|
to be on the remote. (May need to reword that prompt.)
|