2018-03-21 06:25:53 +00:00
|
|
|
When `git annex export treeish` is used to export to a remote, and the
|
|
|
|
remote allows files to somehow be edited on it, then there ought to be a
|
|
|
|
way to import the changes back from the remote into the git repository.
|
|
|
|
|
|
|
|
The command could be `git annex import treeish` or something like that.
|
|
|
|
|
|
|
|
It would ask the special remote to list changed/new files, and deleted
|
|
|
|
files. Download the changed/new files and inject into the annex.
|
|
|
|
Generate a new treeish, with parent the treeish that was exported,
|
|
|
|
that has the modifications in it.
|
|
|
|
|
|
|
|
Updating the working copy is then done by merging the import treeish.
|
|
|
|
This way, conflicts will be detected and handled as normal by git.
|
|
|
|
|
2018-06-14 21:14:13 +00:00
|
|
|
----
|
|
|
|
|
|
|
|
The remote interface could have a new method, to list the changed/new and
|
2018-03-21 06:25:53 +00:00
|
|
|
deleted files. It will be up to remotes to implement that if they can
|
|
|
|
support importing.
|
|
|
|
|
|
|
|
One way for a remote to do it, assuming it has mtimes, is to export
|
|
|
|
files to the remote with their mtime set to the date of the treeish
|
|
|
|
being exported (when the treeish is a commit, which has dates, and not
|
|
|
|
a raw tree). Then the remote can simply enumerate all files,
|
|
|
|
with their mtimes, and look for files that have mtimes
|
2018-06-14 17:30:34 +00:00
|
|
|
newer than the last exported treeish's date.
|
2018-03-21 06:25:53 +00:00
|
|
|
|
2018-06-14 17:30:34 +00:00
|
|
|
> But: If files on the remote are being changed at around the time
|
|
|
|
> of the export, they could have older mtimes than the exported treeish's
|
|
|
|
> date, and so be missed.
|
|
|
|
>
|
|
|
|
> Also, a rename that swaps two files would be missed if mtimes
|
|
|
|
> are only compared to the treeish's date.
|
|
|
|
|
|
|
|
A perhaps better way is for the remote to keep track of the mtime,
|
|
|
|
size, etc of all exported files, and use that state to find changes.
|
|
|
|
Where to store that data?
|
|
|
|
|
|
|
|
The data could be stored in a file/files on the remote, or perhaps
|
|
|
|
the remote has a way to store some arbitrary metadata about a file
|
2018-06-14 21:14:13 +00:00
|
|
|
that could be used.
|
2018-06-14 17:30:34 +00:00
|
|
|
|
|
|
|
It could be stored in git-annex branch per-remote state. However,
|
|
|
|
that state is per-key, not per-file. The export database could be
|
|
|
|
used to convert a ExportLocation to a Key, which could be used
|
|
|
|
to access the per-remote state. Querying the database for each file
|
|
|
|
in the export could be a bottleneck without the right interface.
|
|
|
|
|
|
|
|
If only one repository will ever access the remote, it could be stored
|
|
|
|
in eg a local database. But access from only one repository is a
|
|
|
|
hard invariant to guarantee.
|
|
|
|
|
|
|
|
Would local storage pose a problem when multiple repositories import from
|
|
|
|
the same remote? In that case, perhaps different trees would be imported,
|
|
|
|
and merged into master. So the two repositories then have differing
|
|
|
|
masters, which can be reconciled as usual. It would mean extra downloads
|
|
|
|
of content from the remote, since each import would download its own copy.
|
|
|
|
Perhaps this is acceptable?
|
|
|
|
|
2018-06-14 21:14:13 +00:00
|
|
|
This feels like it's reimplementing the git index, on a per-remote basis.
|
|
|
|
So perhaps this is not the right interface.
|
|
|
|
|
2018-06-14 17:30:34 +00:00
|
|
|
----
|
|
|
|
|
2018-06-14 21:14:13 +00:00
|
|
|
Alternate interface: The remote is responsible for collecting a list of
|
|
|
|
files currently in it, along with some content identifier. That data is
|
|
|
|
sent to git-annex. git-annex keep track of which content identifier(s) map
|
|
|
|
to which keys, and uses the information to determine when a file on the
|
|
|
|
remote has changed or is new.
|
2018-06-14 17:30:34 +00:00
|
|
|
|
|
|
|
This way, each special remote doesn't have to reimplement the equivilant of
|
|
|
|
the git index, or comparing lists of files, it only needs a way to list
|
|
|
|
files, and a good content identifier.
|
|
|
|
|
2018-06-14 21:14:13 +00:00
|
|
|
This also simplifies implementation in git-annex, because it does not
|
|
|
|
even need to look for changed/new/deleted files compared with the
|
|
|
|
old tree. Instead, it can simply build git tree objects as the file list
|
|
|
|
comes in, looking up the key corresponding to each content identifier
|
|
|
|
(or downloading the content from the remote and adding it to the annex
|
|
|
|
when there's no corresponding key yet). It might be possible to avoid
|
|
|
|
git-annex buffering much tree data in memory.
|
|
|
|
|
|
|
|
----
|
|
|
|
|
2018-06-14 17:30:34 +00:00
|
|
|
A good content identifier needs to:
|
|
|
|
|
|
|
|
* Be stable, so when a file has not changed, the content identifier
|
|
|
|
remains the same.
|
|
|
|
* Change when a file is modified.
|
|
|
|
* Be reasonably unique, but not necessarily fully unique.
|
|
|
|
For example, if the mtime of a file is used as the content identifier, then
|
|
|
|
a rename that swaps two files would be noticed, except for in the
|
2018-06-14 17:42:25 +00:00
|
|
|
unusual case where they have the same mtime. If a new file
|
2018-06-14 17:30:34 +00:00
|
|
|
is added with the same mtime as some other file in the tree though,
|
2018-06-14 17:42:25 +00:00
|
|
|
git-annex will see that the filename is new, and so can still import it,
|
|
|
|
even though it's seen that content identifier before. Of course, that might
|
|
|
|
result in unncessary downloads (eg of a renamed file), so a more unique
|
|
|
|
content identifer would be better.
|
2018-06-14 16:32:18 +00:00
|
|
|
|
2018-06-14 17:30:34 +00:00
|
|
|
A (size, mtime, inode) tuple is as good a content identifier as git uses in
|
2018-06-14 17:42:25 +00:00
|
|
|
its index. That or a hash of the content would be ideal.
|
|
|
|
|
|
|
|
Do remotes need to tell git-annex about the properties of content
|
|
|
|
identifiers they use, or does git-annex assume a minimum bar, and pay the
|
|
|
|
price with some unncessary transfers of renamed files etc?
|
|
|
|
|
2018-06-14 21:14:13 +00:00
|
|
|
git-annex will need a way to get the content identifiers of files
|
|
|
|
that it stores on the remote when exporting a tree to it, so it can later
|
|
|
|
know if those files have changed.
|
2018-06-14 18:00:49 +00:00
|
|
|
|
2018-07-09 18:05:34 +00:00
|
|
|
----
|
|
|
|
|
|
|
|
## race conditions TODO
|
|
|
|
|
2018-06-14 21:14:13 +00:00
|
|
|
There's a race here, since a file could be modified on the remote while
|
2018-07-09 18:05:34 +00:00
|
|
|
it's being exported, and if the remote then uses the mtime of the modified
|
|
|
|
file in the content identifier, the modification would never be noticed by
|
|
|
|
imports.
|
2018-06-14 17:42:25 +00:00
|
|
|
|
2018-07-09 18:05:34 +00:00
|
|
|
To fix this race, we need an atomic move operation on the remote. Upload
|
|
|
|
the file to a temp file, then get its content identifier, and then move it
|
|
|
|
from the temp file to its final location.
|
2018-06-14 18:00:49 +00:00
|
|
|
|
2018-06-14 21:14:13 +00:00
|
|
|
There's also a race where a file gets changed on the remote after an
|
2018-07-09 18:05:34 +00:00
|
|
|
import tree, and an export then overwrites it with something else.
|
|
|
|
|
|
|
|
One solution would be to only allow one of importtree or exporttree
|
|
|
|
to a given remote. This reduces the use cases a lot though, and perhaps
|
|
|
|
so far that the import tree feature is not worth building. The adb
|
|
|
|
special remote needs both.
|
|
|
|
|
|
|
|
Really fixing this race needs locking or an atomic operation. Locking seems
|
|
|
|
unlikely to be a portable enough solution.
|
|
|
|
|
|
|
|
An atomic move could at least narrow the race significantly, eg:
|
|
|
|
|
|
|
|
1. upload new version of $file to $tmp1
|
|
|
|
2. atomic move current $file to $tmp2
|
|
|
|
3. Get content identifier of $tmp2, check if it's what was expected to
|
|
|
|
be. If not, $file was modified after the last import tree, and that
|
|
|
|
conflict has to be resolved. Otherwise, delete $tmp2
|
|
|
|
4. atomic move $tmp1 to $file
|
|
|
|
|
2018-07-09 18:38:34 +00:00
|
|
|
A remaining race is that, if the file is open for write at the same
|
2018-07-09 18:05:34 +00:00
|
|
|
time it's renamed, the write might happen after the content identifer
|
|
|
|
is checked, and then whatever is written to it will be lost.
|
|
|
|
|
|
|
|
But: Git worktree update has the same race condition. Verified with
|
|
|
|
this perl oneliner, run in a worktree, followed by a git pull. The lines
|
|
|
|
that it appended to the file got lost:
|
|
|
|
|
|
|
|
perl -e 'open (OUT, ">>foo") || die "$!"; while (<>) { print OUT $_ }'
|
|
|
|
|
|
|
|
Since this is acceptable in git, I suppose we can accept it here too..
|
2018-06-14 21:14:13 +00:00
|
|
|
|
2018-07-09 18:38:34 +00:00
|
|
|
Another remaining race is if the file gets recreated after it's moved out
|
|
|
|
of the way. If the atomic move refuses to overwrite existing files, that race
|
|
|
|
would be detected by it failing. renameat(2) with `RENAME_NOREPLACE` can do that,
|
|
|
|
but probably many special remote interfaces don't provide a way to do that.
|
|
|
|
|
2018-06-14 21:14:13 +00:00
|
|
|
----
|
|
|
|
|
|
|
|
Since exporttree remotes don't have content identifier information yet, it
|
|
|
|
needs to be collected the first time import tree is used. (Or import
|
|
|
|
everything, but that is probably too expensive). Any modifications made to
|
|
|
|
exported files before the first import tree would not be noticed. Seems
|
|
|
|
acceptible as long as this only affects exporttree remotes created before
|
|
|
|
this feature was added.
|
2018-06-14 18:00:49 +00:00
|
|
|
|
|
|
|
What if repo A is being used to import tree from R for a while, and the
|
|
|
|
user gets used to editing files on R and importing them. Then they stop
|
|
|
|
using A and switch to clone B. It would not have the content identifier
|
|
|
|
information that A did (unless it's stored in git-annex branch rather than
|
|
|
|
locally). It seems that in this case, B needs to re-download everything,
|
|
|
|
since anything could have changed since the last time A imported.
|
2018-06-14 21:14:13 +00:00
|
|
|
That seems too expensive!
|
|
|
|
|
2018-06-14 18:00:49 +00:00
|
|
|
Would storing content identifiers in the git-annex branch be too expensive?
|
2018-03-21 06:25:53 +00:00
|
|
|
|
2018-03-21 13:19:06 +00:00
|
|
|
----
|
|
|
|
|
|
|
|
If multiple repos can access the remote at the same time, then there's a
|
|
|
|
potential problem when one is exporting a new tree, and the other one is
|
|
|
|
importing from the remote.
|
|
|
|
|
2018-06-14 16:32:18 +00:00
|
|
|
> This can be reduced to the same problem as exports of two
|
|
|
|
> different trees to the same remote, which is already handled with the
|
|
|
|
> export log.
|
|
|
|
>
|
|
|
|
> Once a tree has been imported from the remote, it's
|
|
|
|
> in the same state as exporting that same tree to the remote, so
|
|
|
|
> update the export log to say that the remote has that treeish exported
|
|
|
|
> to it. A conflict between two export log entries will be handled as
|
|
|
|
> usual, with the user being prompted to re-export the tree they want
|
|
|
|
> to be on the remote. (May need to reword that prompt.)
|
|
|
|
> --[[Joey]]
|
|
|
|
|
2018-03-21 13:19:06 +00:00
|
|
|
----
|
|
|
|
|
2018-03-21 06:25:53 +00:00
|
|
|
See also, [[adb_special_remote]]
|