git-annex/doc/todo/import_tree.mdwn

When `git annex export treeish --to remote` is used to export to a remote,
and the remote allows files to somehow be edited on it, then there ought
to be a way to import the changes back from the remote into the git repository.
The command could be `git annex import --from remote`

There also ought to be a way to make `git annex sync` automatically import.

See [[design/importing_trees_from_special_remotes]] for current design for
this.

## implementation notes

* updateExportTreeFromLog deadlocks when running git-annex export because
  it locks the export db first.

  Could switch to fine-grained locking, but Command.Export would need to
  lock and flush writes to the database many times, and that may be too
  expensive.

  How about this: Make an action that waits to lock the export db and
  runs updateExportTreeFromLog. While the update is running take an
  exclusive lock on an update lock file. Only lock the database using that,
  in Command.Export etc.

  Then, in ExportImport, it only has to try to run that action;
  if the action fails due to the lock being held by the same or another
  process, it suffices to take a shared lock of the update lock file
  (and immediately release it), in order to wait for the update to
  complete.

* Need to support annex-tracking-branch configuration, which documentation
  says makes git-annex sync and assistant do imports.

* git-annex import needs to say when it's downloading files, display
  progress bars, and support concurrent downloads.

* When on an adjusted unlocked branch, need to import the files unlocked.
  Also, the tracking branch code needs to know about such branches,
  currently it will generate the wrong tracking branch.

  The test case for `export_import` currently has a line commented out
  that fails on adjusted unlocked branches.

  Alternatively, could not do anything special for adjusted branches,
  so generating a non-adjusted branch, and require the user use `git annex
  sync` to merge in that branch. Rationalle: After fetching from a normal
  git repo in an adjusted branch, merging does the same thing, and the docs
  say to use `git annex sync` instead. Any improvments to that workflow
  (like eg a way to merge a specified branch and update the adjustment)
  would thus benefit both uses cases.

* What if the remote lists importable filenames that are absolute paths,
  or contain a "../" attack? Does git already guard against merging such
  trees?

* Need to support annex.largefiles when importing.

* If a tree containing a non-annexed file is exported,
  and then an import is done from the remote, the new tree will have that
  file annexed, and so merging it converts to annexed (there is no merge
  conflict). This problem seems hard to avoid, other than relaying on
  annex.largefiles to tell git-annex if a file should be imported
  non-annexed.

  Although.. The importer could check for each file,
  if there's a corresponding file in the branch it's generating the
  import for, if that file is annexed. But this might be slow and seems a
  lot of bother for an edge case?

## race conditions

(Some thoughts about races that the design should cover now, but kept here
for reference.)

A file could be modified on the remote while
it's being exported, and if the remote then uses the mtime of the modified
file in the content identifier, the modification would never be noticed by
imports.

To fix this race, we need an atomic move operation on the remote. Upload
the file to a temp file, then get its content identifier, and then move it
from the temp file to its final location. Alternatively, upload a file and
get the content identifier atomically, which eg S3 with versioning enabled
provides. It would make sense to have the storeExport operation always return
a content identifier and document that it needs to get it atomically by
either using a temp file or something specific to the remote.

----

There's also a race where a file gets changed on the remote after an
import tree, and an export then overwrites it with something else.

One solution would be to only allow one of importtree or exporttree
to a given remote. This reduces the use cases a lot though, and perhaps
so far that the import tree feature is not worth building. The adb
special remote needs both. Also, such a limitation seems like one that
users might try to work around by initializing two remotes using the same
data and trying to use one for import and the other for export.

Really fixing this race needs locking or an atomic operation. Locking seems
unlikely to be a portable enough solution.

An atomic rename operation could at least narrow the race significantly, eg:

1. get content identifier of $file, check if it's what was expected else
   abort (optional but would catch most problems)
2. upload new version of $file to $tmp1
3. rename current $file to $tmp2
4. Get content identifier of $tmp2, check if it's what was expected to
   be. If not, $file was modified after the last import tree, and that
   conflict has to be resolved. Otherwise, delete $tmp2
5. rename $tmp1 to $file

That leaves a race if the file gets overwritten after it's moved out
of the way. If the rename refuses to overwrite existing files, that race
would be detected by it failing. renameat(2) with `RENAME_NOREPLACE` can do that,
but probably many special remote interfaces don't provide a way to do that.

S3 lacks a rename operation, can only copy and then delete. Which is not
good enough; it risks the file being replaced with new content before
the delete and the new content being deleted.

Is this race really a significant problem? One way to look at it is
analagous to a git merge overwriting a locally modified file.
Git can certianly use similar techniques to entirely detect and recover
from such races (but not the similar race described in the next section).
But, git does not actually do that! I modified git's
merge.c to sleep for 10 seconds after `refresh_index()`, and verified
that changes made to the work tree in that window were silently overwritten
by git merge. In git's case, the race window is normally quite narrow
and this is very unlikely to happen (the similar race described in the next
section is more likely).

If git-annex could get the race window similarly small out would perhaps be
ok. Eg:

1. upload new version of $file to $tmp
2. get content identifier of $file, check if it's what was expected else
   abort
3. rename (or copy and delete) $tmp to $file

The race window between #2 and #3 could be quite narrow for some remotes.
But S3, lacking a rename, does a copy that can be very slow for large files.

S3, with versioning, could detect the race after the fact, by listing
the versions of the file, and checking if any of the versions is one
that git-annex did not know the file already had.
[Using this api](https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGETVersion.html),
with version-id-marker set to the previous version of the file,
should list only the previous and current versions; if there's an
intermediate version then the race occurred and it could roll the change
back, or otherwise recover the overwritten version. This could be done at
import time, to detect a previous race, and recover from it; importing
a tree with the file(s) that were overwritten due to the race, leading to a
tree import conflict that the user can resolve. This likely generalizes
to importing a sequence of trees, so each version written to S3 gets
imported.

----

A remaining race is that, if the file is open for write at the same
time it's renamed, the write might happen after the content identifer
is checked, and then whatever is written to it will be lost.

But: Git worktree update has the same race condition. Verified with
this perl oneliner, run in a worktree and a second later
followed by a git pull. The lines that it appended to the
file got lost:

	perl -e 'open (OUT, ">>foo") || die "$!"; sleep(10); while (<>) { print OUT $_ }'

Since this is acceptable in git, I suppose we can accept it here too..

----

See also, [[adb_special_remote]]