From d128c8c3ec6749f47824f332e1510fe9de737c8c Mon Sep 17 00:00:00 2001
From: Joey Hess <joeyh@joeyh.name>
Date: Wed, 20 Feb 2019 12:12:32 -0400
Subject: [PATCH] add design document for import tree

---
 .../exporting_trees_to_special_remotes.mdwn   |   6 +
 .../importing_trees_from_special_remotes.mdwn | 192 ++++++++++++++++++
 doc/todo/import_tree.mdwn                     | 146 +------------
 3 files changed, 203 insertions(+), 141 deletions(-)
 create mode 100644 doc/design/importing_trees_from_special_remotes.mdwn

diff --git a/doc/design/exporting_trees_to_special_remotes.mdwn b/doc/design/exporting_trees_to_special_remotes.mdwn
index 6cf7383605..5d6746f5b2 100644
--- a/doc/design/exporting_trees_to_special_remotes.mdwn
+++ b/doc/design/exporting_trees_to_special_remotes.mdwn
@@ -4,6 +4,12 @@ and content from the tree.
 
 (See also [[todo/export]] and [[todo/dumb, unsafe, human-readable_backend]])
 
+Note that this document was written with the assumption that only git-annex
+is writing to the special remote. But
+[[importing_trees_from_special_remotes]] invalidates that assumption,
+and needed to add some additional things to deal with it. See that link for
+details.
+
 [[!toc ]]
 
 ## configuring a special remote for tree export
diff --git a/doc/design/importing_trees_from_special_remotes.mdwn b/doc/design/importing_trees_from_special_remotes.mdwn
new file mode 100644
index 0000000000..904bab1b20
--- /dev/null
+++ b/doc/design/importing_trees_from_special_remotes.mdwn
@@ -0,0 +1,192 @@
+Importing trees from special remotes allows data published by others to be
+gathered. It also combines with [[exporting_trees_to_special_remotes]]
+to let a special remote act as a kind of git working tree without `.git`,
+that the user can alter data in as they like and use git-annex to pull
+their changes into the local repository's version control.
+
+(See also [[todo/import_tree]].)
+
+The basic idea is to have a `git annex import --from remote` command.
+
+It would find changed/new/deleted files on the remote.
+Download the changed/new files and inject into the annex. 
+Generate a new treeish, with parent the treeish that was exported earlier,
+that has the modifications in it.
+
+Updating the local working copy is then done by merging the import treeish.
+This way, conflicts will be detected and handled as normal by git.
+
+## content identifiers
+
+The remote is responsible for collecting a list of
+files currently in it, along with some content identifier. That data is
+sent to git-annex. git-annex keeps track of which content identifier(s) map
+to which keys, and uses the information to determine when a file on the
+remote has changed or is new.
+
+git-annex can simply build git tree objects as the file list
+comes in, looking up the key corresponding to each content identifier
+(or downloading the content from the remote and adding it to the annex
+when there's no corresponding key yet). It might be possible to avoid
+git-annex buffering much tree data in memory.
+
+----
+
+A good content identifier needs to:
+
+* Be stable, so when a file has not changed, the content identifier
+  remains the same.
+* Change when a file is modified.
+* Be as unique as possible, but not necessarily fully unique.  
+  A hash of the content would be ideal.
+  A (size, mtime, inode) tuple is as good a content identifier as git uses in
+  its index.
+
+git-annex will need a way to get the content identifiers of files
+that it stores on the remote when exporting a tree to it, so it can later
+know if those files have changed.
+
+----
+
+The content identifier needs to be stored somehow for later use.
+
+It would be good to store the content identifiers only locally, if
+possible.
+
+Would local storage pose a problem when multiple repositories import from
+the same remote? In that case, perhaps different trees would be imported,
+and merged into master. So the two repositories then have differing
+masters, which can be reconciled in merge as usual.
+
+Since exporttree remotes don't have content identifier information yet, it
+needs to be collected the first time import tree is used. (Or import
+everything, but that is probably too expensive). Any modifications made to
+exported files before the first import tree would not be noticed. Seems
+acceptible as long as this only affects exporttree remotes created before
+this feature was added.
+
+What if repo A is being used to import tree from R for a while, and the
+user gets used to editing files on R and importing them. Then they stop
+using A and switch to clone B. It would not have the content identifier
+information that A did. It seems that in this case, B needs to re-download
+everything, to build up the map of content identifiers.
+(Anything could have changed since the last time A imported).
+That seems too expensive!
+
+Would storing content identifiers in the git-annex branch be too
+expensive? Probably not.. For S3 with versioning a content identifier is
+already stored. When the content identifier is (mtime, size, inode),
+that's a small amount of data. The maximum size of a content identifier
+could be limited to the size of a typical hash, and if a remote for some
+reason gets something larger, it could simply hash it to generate
+the content identifier.
+
+## safety
+
+Since the special remote can be written to at any time by something other
+than git-annex, git-annex needs to take care when exporting to it, to avoid
+overwriting such changes.
+
+This is similar to how git merge avoids overwriting modified files in the
+working tree.
+
+Surprisingly, git merge doesn't avoid overwrites in all conditions! I
+modified git's merge.c to sleep for 10 seconds after `refresh_index()`, and
+verified that changes made to the work tree in that window were silently
+overwritten by git merge. In git's case, the race window is normally quite
+narrow and this is very unlikely to happen.
+
+Also, git merge can overwrite a file that a process has open for write;
+the processes's changes then get lost. Verified with
+this perl oneliner, run in a worktree and a second later 
+followed by a git pull. The lines that it appended to the 
+file got lost:
+
+	perl -e 'open (OUT, ">>foo") || die "$!"; sleep(10); while (<>) { print OUT $_ }'
+
+git-annex should take care to be at least as safe as git merge when
+exporting to a special remote that supports imports.
+
+The situations to keep in mind are these:
+
+1. File is changed on the remote after an import tree, and an export wants
+   to also change it. Need to avoid the export overwriting the
+   file. Or, need a way to detect such an overwrite and recover the version
+   of the file that got overwritten, after the fact.
+
+2. File is changed on the remote while it's being imported, and part of one
+   version + part of the other version is downloaded. Need to detect this
+   and fail the import.
+
+3. File is changed on the remote after its content identifier is checked
+   and before it's downloaded, so the wrong version gets downloaded.
+   Need to detect this and fail the import.
+
+## api design
+
+This is an extension to the ExportActions api.
+
+	listContents :: Annex (Tree [(ExportLocation, ContentIdentifier)])
+
+	getContentIdentifier :: ExportLocation -> Annex (Maybe ContentIdentifier)
+	
+	retrieveExportWithContentIdentifier :: ExportLocation -> ContentIdentifier -> (FilePath -> Annex Key) -> MeterUpdate -> Annex (Maybe Key)
+
+	storeExportWithContentIdentifier :: FilePath -> Key -> ExportLocation -> MeterUpdate -> Annex (Maybe ContentIdentifier)
+
+listContents finds the current set of files that are stored in the remote,
+some of which may have been written by other programs than git-annex,
+along with their content identifiers. It returns a list of those, often in
+a single node tree.
+
+listContents may also find past versions of files that are stored in the
+remote, when it supports storing multiple versions of files. Since it
+returns a tree of lists of files, it can represent anything from a linear
+history to a full branching version control history.
+
+retrieveExportWithContentIdentifier is used when downloading a new file from 
+the remote that listContents found. retrieveExport can't be used because
+it has a Key parameter and the key is not yet known in this case.
+(The callback generating a key will let eg S3 record the S3 version id for
+the key.)
+
+retrieveExportWithContentIdentifier should detect when the file it's
+downloaded may not match the requested content identifier (eg when
+something else wrote to it while it was being retrieved), and fail
+in that case.
+
+storeExportWithContentIdentifier stores content and returns the 
+content identifier corresponding to what it stored. It can either get
+the content identifier in reply to the store (as S3 does with versioning),
+or it can store to a temp location, get the content identifier of that,
+and then rename the content into place.
+
+storeExportWithContentIdentifier must avoid overwriting any file that may
+have been written to the remote by something else (unless that version of
+the file can later be recovered by listContents), so it will typically
+need to query for the content identifier before moving the new content
+into place. FIXME: How does it know when it's safe to overwrite a file?
+Should it be passed the content identifier that it's allowed to overwrite?
+
+storeExportWithContentIdentifier needs to handle the case when there's a
+race with a concurrent writer. It needs to avoid getting the wrong
+ContentIdentifier for data written by the other writer. It may detect such
+races and fail, or it could succeed and overwrite the other file, so long
+as it can later be recovered by listContents.
+
+## multiple git-annex repos accessing a special remote
+
+If multiple repos can access the remote at the same time, then there's a
+potential problem when one is exporting a new tree, and the other one is
+importing from the remote.
+
+This can be reduced to the same problem as exports of two
+different trees to the same remote, which is already handled with the
+export log.
+
+Once a tree has been imported from the remote, it's
+in the same state as exporting that same tree to the remote, so
+update the export log to say that the remote has that treeish exported
+to it. A conflict between two export log entries will be handled as
+usual, with the user being prompted to re-export the tree they want
+to be on the remote. (May need to reword that prompt.)
diff --git a/doc/todo/import_tree.mdwn b/doc/todo/import_tree.mdwn
index 72e49f112b..d53f0e214a 100644
--- a/doc/todo/import_tree.mdwn
+++ b/doc/todo/import_tree.mdwn
@@ -3,80 +3,13 @@ and the remote allows files to somehow be edited on it, then there ought
 to be a way to import the changes back from the remote into the git repository.
 The command could be `git annex import --from remote`
 
-It would find changed/new/deleted files on the remote.
-Download the changed/new files and inject into the annex. 
-Generate a new treeish, with parent the treeish that was exported,
-that has the modifications in it.
+See [[design/importing_trees_from_special_remotes]] for current design for
+this.
 
-Updating the working copy is then done by merging the import treeish.
-This way, conflicts will be detected and handled as normal by git.
+## race conditions
 
-## content identifiers
-
-The remote is responsible for collecting a list of
-files currently in it, along with some content identifier. That data is
-sent to git-annex. git-annex keeps track of which content identifier(s) map
-to which keys, and uses the information to determine when a file on the
-remote has changed or is new.
-
-git-annex can simply build git tree objects as the file list
-comes in, looking up the key corresponding to each content identifier
-(or downloading the content from the remote and adding it to the annex
-when there's no corresponding key yet). It might be possible to avoid
-git-annex buffering much tree data in memory.
-
-----
-
-A good content identifier needs to:
-
-* Be stable, so when a file has not changed, the content identifier
-  remains the same.
-* Change when a file is modified.
-* Be as unique as possible, but not necessarily fully unique.  
-  A hash of the content would be ideal.
-  A (size, mtime, inode) tuple is as good a content identifier as git uses in
-  its index.
-
-git-annex will need a way to get the content identifiers of files
-that it stores on the remote when exporting a tree to it, so it can later
-know if those files have changed.
-
-----
-
-The content identifier needs to be stored somehow for later use.
-
-It would be good to store the content identifiers only locally, if
-possible.
-
-Would local storage pose a problem when multiple repositories import from
-the same remote? In that case, perhaps different trees would be imported,
-and merged into master. So the two repositories then have differing
-masters, which can be reconciled in merge as usual.
-
-Since exporttree remotes don't have content identifier information yet, it
-needs to be collected the first time import tree is used. (Or import
-everything, but that is probably too expensive). Any modifications made to
-exported files before the first import tree would not be noticed. Seems
-acceptible as long as this only affects exporttree remotes created before
-this feature was added.
-
-What if repo A is being used to import tree from R for a while, and the
-user gets used to editing files on R and importing them. Then they stop
-using A and switch to clone B. It would not have the content identifier
-information that A did. It seems that in this case, B needs to re-download
-everything, to build up the map of content identifiers.
-(Anything could have changed since the last time A imported).
-That seems too expensive!
-
-Would storing content identifiers in the git-annex branch be too
-expensive? Probably not.. For S3 with versioning a content identifier is
-already stored. When the content identifier is (mtime, size, inode),
-that's a small amount of data. The maximum size of a content identifier
-could be limited to the size of a typical hash, and if a remote for some
-reason gets something larger, it could simply hash it to generate
-the content identifier.
-
-## race conditions TODO
+(Some thoughts about races that the design should cover now, but kept here
+for reference.)
 
 A file could be modified on the remote while
 it's being exported, and if the remote then uses the mtime of the modified
@@ -179,73 +112,4 @@ Since this is acceptable in git, I suppose we can accept it here too..
 
 ----
 
-If multiple repos can access the remote at the same time, then there's a
-potential problem when one is exporting a new tree, and the other one is
-importing from the remote.
-
-> This can be reduced to the same problem as exports of two
-> different trees to the same remote, which is already handled with the
-> export log.
-> 
-> Once a tree has been imported from the remote, it's
-> in the same state as exporting that same tree to the remote, so
-> update the export log to say that the remote has that treeish exported
-> to it. A conflict between two export log entries will be handled as
-> usual, with the user being prompted to re-export the tree they want
-> to be on the remote. (May need to reword that prompt.)
-> --[[Joey]]
-
-## api design
-
-Pulling all of the above together, this is an extension to the
-ExportActions api.
-
-	listContents :: Annex (Tree [(ExportLocation, ContentIdentifier)])
-
-	getContentIdentifier :: ExportLocation -> Annex (Maybe ContentIdentifier)
-	
-	retrieveExportWithContentIdentifier :: ExportLocation -> ContentIdentifier -> (FilePath -> Annex Key) -> MeterUpdate -> Annex (Maybe Key)
-
-	storeExportWithContentIdentifier :: FilePath -> Key -> ExportLocation -> MeterUpdate -> Annex (Maybe ContentIdentifier)
-
-listContents finds the current set of files that are stored in the remote,
-some of which may have been written by other programs than git-annex,
-along with their content identifiers. It returns a list of those, often in
-a single node tree.
-
-listContents may also find past versions of files that are stored in the
-remote, when it supports storing multiple versions of files. Since it
-returns a tree of lists of files, it can represent anything from a linear
-history to a full branching version control history.
-
-retrieveExportWithContentIdentifier is used when downloading a new file from 
-the remote that listContents found. retrieveExport can't be used because
-it has a Key parameter and the key is not yet known in this case.
-(The callback generating a key will let eg S3 record the S3 version id for
-the key.)
-
-retrieveExportWithContentIdentifier should detect when the file it's
-downloaded may not match the requested content identifier (eg when
-something else wrote to it), and fail in that case.
-
-storeExportWithContentIdentifier is used to get the content identifier
-corresponding to what it stores. It can either get the content
-identifier in reply to the store (as S3 does with versioning), or it can
-store to a temp location, get the content identifier of that, and then
-rename the content into place.
-
-storeExportWithContentIdentifier must avoid overwriting any file that may
-have been written to the remote by something else (unless that version of
-the file can later be recovered by listContents), so it will typically
-need to query for the content identifier before moving the new content
-into place.
-
-storeExportWithContentIdentifier needs to handle the case when there's a
-race with a concurrent writer. It needs to avoid getting the wrong
-ContentIdentifier for data written by the other writer. It may detect such
-races and fail, or it could succeed and overwrite the other file, so long
-as it can later be recovered by listContents.
-
-----
-
 See also, [[adb_special_remote]]