git-annex/doc/todo/import_tree.mdwn

When `git annex export treeish` is used to export to a remote, and the
remote allows files to somehow be edited on it, then there ought to be a
way to import the changes back from the remote into the git repository.

The command could be `git annex import treeish` or something like that.

It would ask the special remote to list changed/new files, and deleted
files. Download the changed/new files and inject into the annex. 
Generate a new treeish, with parent the treeish that was exported,
that has the modifications in it.

Updating the working copy is then done by merging the import treeish.
This way, conflicts will be detected and handled as normal by git.

----

The remote interface could have a new method, to list the changed/new and
deleted files. It will be up to remotes to implement that if they can
support importing.

One way for a remote to do it, assuming it has mtimes, is to export
files to the remote with their mtime set to the date of the treeish
being exported (when the treeish is a commit, which has dates, and not
a raw tree). Then the remote can simply enumerate all files,
with their mtimes, and look for files that have mtimes
newer than the last exported treeish's date.

> But: If files on the remote are being changed at around the time
> of the export, they could have older mtimes than the exported treeish's
> date, and so be missed.
> 
> Also, a rename that swaps two files would be missed if mtimes
> are only compared to the treeish's date.

A perhaps better way is for the remote to keep track of the mtime,
size, etc of all exported files, and use that state to find changes.
Where to store that data?

The data could be stored in a file/files on the remote, or perhaps
the remote has a way to store some arbitrary metadata about a file
that could be used.

It could be stored in git-annex branch per-remote state. However,
that state is per-key, not per-file. The export database could be
used to convert a ExportLocation to a Key, which could be used
to access the per-remote state. Querying the database for each file
in the export could be a bottleneck without the right interface.

If only one repository will ever access the remote, it could be stored
in eg a local database. But access from only one repository is a 
hard invariant to guarantee.

Would local storage pose a problem when multiple repositories import from
the same remote? In that case, perhaps different trees would be imported,
and merged into master. So the two repositories then have differing
masters, which can be reconciled as usual. It would mean extra downloads
of content from the remote, since each import would download its own copy.
Perhaps this is acceptable?

This feels like it's reimplementing the git index, on a per-remote basis.
So perhaps this is not the right interface.

----

Alternate interface: The remote is responsible for collecting a list of
files currently in it, along with some content identifier. That data is
sent to git-annex. git-annex keep track of which content identifier(s) map
to which keys, and uses the information to determine when a file on the
remote has changed or is new.

This way, each special remote doesn't have to reimplement the equivilant of
the git index, or comparing lists of files, it only needs a way to list
files, and a good content identifier.

This also simplifies implementation in git-annex, because it does not
even need to look for changed/new/deleted files compared with the
old tree. Instead, it can simply build git tree objects as the file list
comes in, looking up the key corresponding to each content identifier
(or downloading the content from the remote and adding it to the annex
when there's no corresponding key yet). It might be possible to avoid
git-annex buffering much tree data in memory.

----

A good content identifier needs to:

* Be stable, so when a file has not changed, the content identifier
  remains the same.
* Change when a file is modified.
* Be reasonably unique, but not necessarily fully unique.  
  For example, if the mtime of a file is used as the content identifier, then
  a rename that swaps two files would be noticed, except for in the
  unusual case where they have the same mtime. If a new file
  is added with the same mtime as some other file in the tree though,
  git-annex will see that the filename is new, and so can still import it,
  even though it's seen that content identifier before. Of course, that might
  result in unncessary downloads (eg of a renamed file), so a more unique
  content identifer would be better.

A (size, mtime, inode) tuple is as good a content identifier as git uses in
its index. That or a hash of the content would be ideal. 

Do remotes need to tell git-annex about the properties of content
identifiers they use, or does git-annex assume a minimum bar, and pay the
price with some unncessary transfers of renamed files etc?

git-annex will need a way to get the content identifiers of files
that it stores on the remote when exporting a tree to it, so it can later
know if those files have changed.

----

## race conditions TODO

There's a race here, since a file could be modified on the remote while
it's being exported, and if the remote then uses the mtime of the modified
file in the content identifier, the modification would never be noticed by
imports. 

To fix this race, we need an atomic move operation on the remote. Upload
the file to a temp file, then get its content identifier, and then move it
from the temp file to its final location.

There's also a race where a file gets changed on the remote after an
import tree, and an export then overwrites it with something else.

One solution would be to only allow one of importtree or exporttree
to a given remote. This reduces the use cases a lot though, and perhaps
so far that the import tree feature is not worth building. The adb
special remote needs both.

Really fixing this race needs locking or an atomic operation. Locking seems
unlikely to be a portable enough solution.

An atomic move could at least narrow the race significantly, eg:

1. upload new version of $file to $tmp1
2. atomic move current $file to $tmp2
3. Get content identifier of $tmp2, check if it's what was expected to
   be. If not, $file was modified after the last import tree, and that
   conflict has to be resolved. Otherwise, delete $tmp2
4. atomic move $tmp1 to $file

A remaining race is that, if the file is open for write at the same
time it's renamed, the write might happen after the content identifer
is checked, and then whatever is written to it will be lost. 

But: Git worktree update has the same race condition. Verified with
this perl oneliner, run in a worktree, followed by a git pull. The lines
that it appended to the file got lost:

	perl -e 'open (OUT, ">>foo") || die "$!"; while (<>) { print OUT $_ }'

Since this is acceptable in git, I suppose we can accept it here too..

Another remaining race is if the file gets recreated after it's moved out
of the way. If the atomic move refuses to overwrite existing files, that race
would be detected by it failing. renameat(2) with `RENAME_NOREPLACE` can do that, 
but probably many special remote interfaces don't provide a way to do that.

----

Since exporttree remotes don't have content identifier information yet, it
needs to be collected the first time import tree is used. (Or import
everything, but that is probably too expensive). Any modifications made to
exported files before the first import tree would not be noticed. Seems
acceptible as long as this only affects exporttree remotes created before
this feature was added.

What if repo A is being used to import tree from R for a while, and the
user gets used to editing files on R and importing them. Then they stop
using A and switch to clone B. It would not have the content identifier
information that A did (unless it's stored in git-annex branch rather than
locally). It seems that in this case, B needs to re-download everything,
since anything could have changed since the last time A imported.
That seems too expensive!

Would storing content identifiers in the git-annex branch be too expensive?

----

If multiple repos can access the remote at the same time, then there's a
potential problem when one is exporting a new tree, and the other one is
importing from the remote.

> This can be reduced to the same problem as exports of two
> different trees to the same remote, which is already handled with the
> export log.
> 
> Once a tree has been imported from the remote, it's
> in the same state as exporting that same tree to the remote, so
> update the export log to say that the remote has that treeish exported
> to it. A conflict between two export log entries will be handled as
> usual, with the user being prompted to re-export the tree they want
> to be on the remote. (May need to reword that prompt.)
> --[[Joey]]

----

See also, [[adb_special_remote]]
ideas 2018-03-21 06:25:53 +00:00			When `git annex export treeish` is used to export to a remote, and the
			`remote allows files to somehow be edited on it, then there ought to be a`
			`way to import the changes back from the remote into the git repository.`

			The command could be `git annex import treeish` or something like that.

			`It would ask the special remote to list changed/new files, and deleted`
			`files. Download the changed/new files and inject into the annex.`
			`Generate a new treeish, with parent the treeish that was exported,`
			`that has the modifications in it.`

			`Updating the working copy is then done by merging the import treeish.`
			`This way, conflicts will be detected and handled as normal by git.`

improve wording 2018-06-14 21:14:13 +00:00			`----`

			`The remote interface could have a new method, to list the changed/new and`
ideas 2018-03-21 06:25:53 +00:00			`deleted files. It will be up to remotes to implement that if they can`
			`support importing.`

			`One way for a remote to do it, assuming it has mtimes, is to export`
			`files to the remote with their mtime set to the date of the treeish`
			`being exported (when the treeish is a commit, which has dates, and not`
			`a raw tree). Then the remote can simply enumerate all files,`
			`with their mtimes, and look for files that have mtimes`
more thoughts 2018-06-14 17:30:34 +00:00			`newer than the last exported treeish's date.`
ideas 2018-03-21 06:25:53 +00:00
more thoughts 2018-06-14 17:30:34 +00:00			`> But: If files on the remote are being changed at around the time`
			`> of the export, they could have older mtimes than the exported treeish's`
			`> date, and so be missed.`
			`>`
			`> Also, a rename that swaps two files would be missed if mtimes`
			`> are only compared to the treeish's date.`

			`A perhaps better way is for the remote to keep track of the mtime,`
			`size, etc of all exported files, and use that state to find changes.`
			`Where to store that data?`

			`The data could be stored in a file/files on the remote, or perhaps`
			`the remote has a way to store some arbitrary metadata about a file`
improve wording 2018-06-14 21:14:13 +00:00			`that could be used.`
more thoughts 2018-06-14 17:30:34 +00:00
			`It could be stored in git-annex branch per-remote state. However,`
			`that state is per-key, not per-file. The export database could be`
			`used to convert a ExportLocation to a Key, which could be used`
			`to access the per-remote state. Querying the database for each file`
			`in the export could be a bottleneck without the right interface.`

			`If only one repository will ever access the remote, it could be stored`
			`in eg a local database. But access from only one repository is a`
			`hard invariant to guarantee.`

			`Would local storage pose a problem when multiple repositories import from`
			`the same remote? In that case, perhaps different trees would be imported,`
			`and merged into master. So the two repositories then have differing`
			`masters, which can be reconciled as usual. It would mean extra downloads`
			`of content from the remote, since each import would download its own copy.`
			`Perhaps this is acceptable?`

improve wording 2018-06-14 21:14:13 +00:00			`This feels like it's reimplementing the git index, on a per-remote basis.`
			`So perhaps this is not the right interface.`

more thoughts 2018-06-14 17:30:34 +00:00			`----`

improve wording 2018-06-14 21:14:13 +00:00			`Alternate interface: The remote is responsible for collecting a list of`
			`files currently in it, along with some content identifier. That data is`
			`sent to git-annex. git-annex keep track of which content identifier(s) map`
			`to which keys, and uses the information to determine when a file on the`
			`remote has changed or is new.`
more thoughts 2018-06-14 17:30:34 +00:00
			`This way, each special remote doesn't have to reimplement the equivilant of`
			`the git index, or comparing lists of files, it only needs a way to list`
			`files, and a good content identifier.`

improve wording 2018-06-14 21:14:13 +00:00			`This also simplifies implementation in git-annex, because it does not`
			`even need to look for changed/new/deleted files compared with the`
			`old tree. Instead, it can simply build git tree objects as the file list`
			`comes in, looking up the key corresponding to each content identifier`
			`(or downloading the content from the remote and adding it to the annex`
			`when there's no corresponding key yet). It might be possible to avoid`
			`git-annex buffering much tree data in memory.`

			`----`

more thoughts 2018-06-14 17:30:34 +00:00			`A good content identifier needs to:`

			`* Be stable, so when a file has not changed, the content identifier`
			`remains the same.`
			`* Change when a file is modified.`
			`* Be reasonably unique, but not necessarily fully unique.`
			`For example, if the mtime of a file is used as the content identifier, then`
			`a rename that swaps two files would be noticed, except for in the`
some open questions 2018-06-14 17:42:25 +00:00			`unusual case where they have the same mtime. If a new file`
more thoughts 2018-06-14 17:30:34 +00:00			`is added with the same mtime as some other file in the tree though,`
some open questions 2018-06-14 17:42:25 +00:00			`git-annex will see that the filename is new, and so can still import it,`
			`even though it's seen that content identifier before. Of course, that might`
			`result in unncessary downloads (eg of a renamed file), so a more unique`
			`content identifer would be better.`
thoughts 2018-06-14 16:32:18 +00:00
more thoughts 2018-06-14 17:30:34 +00:00			`A (size, mtime, inode) tuple is as good a content identifier as git uses in`
some open questions 2018-06-14 17:42:25 +00:00			`its index. That or a hash of the content would be ideal.`

			`Do remotes need to tell git-annex about the properties of content`
			`identifiers they use, or does git-annex assume a minimum bar, and pay the`
			`price with some unncessary transfers of renamed files etc?`

improve wording 2018-06-14 21:14:13 +00:00			`git-annex will need a way to get the content identifiers of files`
			`that it stores on the remote when exporting a tree to it, so it can later`
			`know if those files have changed.`
more thoughts 2018-06-14 18:00:49 +00:00
dealing with race conditions in import tree design I seem to be down to a race no worse than one in git, which seems good enough. This commit was sponsored by Trenton Cronholm on Patreon. 2018-07-09 18:05:34 +00:00			`----`

			`## race conditions TODO`

improve wording 2018-06-14 21:14:13 +00:00			`There's a race here, since a file could be modified on the remote while`
dealing with race conditions in import tree design I seem to be down to a race no worse than one in git, which seems good enough. This commit was sponsored by Trenton Cronholm on Patreon. 2018-07-09 18:05:34 +00:00			`it's being exported, and if the remote then uses the mtime of the modified`
			`file in the content identifier, the modification would never be noticed by`
			`imports.`
some open questions 2018-06-14 17:42:25 +00:00
dealing with race conditions in import tree design I seem to be down to a race no worse than one in git, which seems good enough. This commit was sponsored by Trenton Cronholm on Patreon. 2018-07-09 18:05:34 +00:00			`To fix this race, we need an atomic move operation on the remote. Upload`
			`the file to a temp file, then get its content identifier, and then move it`
			`from the temp file to its final location.`
more thoughts 2018-06-14 18:00:49 +00:00
improve wording 2018-06-14 21:14:13 +00:00			`There's also a race where a file gets changed on the remote after an`
dealing with race conditions in import tree design I seem to be down to a race no worse than one in git, which seems good enough. This commit was sponsored by Trenton Cronholm on Patreon. 2018-07-09 18:05:34 +00:00			`import tree, and an export then overwrites it with something else.`

			`One solution would be to only allow one of importtree or exporttree`
			`to a given remote. This reduces the use cases a lot though, and perhaps`
			`so far that the import tree feature is not worth building. The adb`
			`special remote needs both.`

			`Really fixing this race needs locking or an atomic operation. Locking seems`
			`unlikely to be a portable enough solution.`

			`An atomic move could at least narrow the race significantly, eg:`

			`1. upload new version of $file to $tmp1`
			`2. atomic move current $file to $tmp2`
			`3. Get content identifier of $tmp2, check if it's what was expected to`
			`be. If not, $file was modified after the last import tree, and that`
			`conflict has to be resolved. Otherwise, delete $tmp2`
			`4. atomic move $tmp1 to $file`

thought 2018-07-09 18:38:34 +00:00			`A remaining race is that, if the file is open for write at the same`
dealing with race conditions in import tree design I seem to be down to a race no worse than one in git, which seems good enough. This commit was sponsored by Trenton Cronholm on Patreon. 2018-07-09 18:05:34 +00:00			`time it's renamed, the write might happen after the content identifer`
			`is checked, and then whatever is written to it will be lost.`

			`But: Git worktree update has the same race condition. Verified with`
			`this perl oneliner, run in a worktree, followed by a git pull. The lines`
			`that it appended to the file got lost:`

			`perl -e 'open (OUT, ">>foo") \|\| die "$!"; while (<>) { print OUT $_ }'`

			`Since this is acceptable in git, I suppose we can accept it here too..`
improve wording 2018-06-14 21:14:13 +00:00
thought 2018-07-09 18:38:34 +00:00			`Another remaining race is if the file gets recreated after it's moved out`
			`of the way. If the atomic move refuses to overwrite existing files, that race`
			would be detected by it failing. renameat(2) with `RENAME_NOREPLACE` can do that,
			`but probably many special remote interfaces don't provide a way to do that.`

improve wording 2018-06-14 21:14:13 +00:00			`----`

			`Since exporttree remotes don't have content identifier information yet, it`
			`needs to be collected the first time import tree is used. (Or import`
			`everything, but that is probably too expensive). Any modifications made to`
			`exported files before the first import tree would not be noticed. Seems`
			`acceptible as long as this only affects exporttree remotes created before`
			`this feature was added.`
more thoughts 2018-06-14 18:00:49 +00:00
			`What if repo A is being used to import tree from R for a while, and the`
			`user gets used to editing files on R and importing them. Then they stop`
			`using A and switch to clone B. It would not have the content identifier`
			`information that A did (unless it's stored in git-annex branch rather than`
			`locally). It seems that in this case, B needs to re-download everything,`
			`since anything could have changed since the last time A imported.`
improve wording 2018-06-14 21:14:13 +00:00			`That seems too expensive!`

more thoughts 2018-06-14 18:00:49 +00:00			`Would storing content identifiers in the git-annex branch be too expensive?`
ideas 2018-03-21 06:25:53 +00:00
update 2018-03-21 13:19:06 +00:00			`----`

			`If multiple repos can access the remote at the same time, then there's a`
			`potential problem when one is exporting a new tree, and the other one is`
			`importing from the remote.`

thoughts 2018-06-14 16:32:18 +00:00			`> This can be reduced to the same problem as exports of two`
			`> different trees to the same remote, which is already handled with the`
			`> export log.`
			`>`
			`> Once a tree has been imported from the remote, it's`
			`> in the same state as exporting that same tree to the remote, so`
			`> update the export log to say that the remote has that treeish exported`
			`> to it. A conflict between two export log entries will be handled as`
			`> usual, with the user being prompted to re-export the tree they want`
			`> to be on the remote. (May need to reword that prompt.)`
			`> --[[Joey]]`

update 2018-03-21 13:19:06 +00:00			`----`

ideas 2018-03-21 06:25:53 +00:00			`See also, [[adb_special_remote]]`