starting api design

2019-02-11 15:47:18 -04:00 · 2019-02-11 15:47:18 -04:00 · 87987c78cf
commit 87987c78cf
parent b7991248db
1 changed files with 50 additions and 21 deletions
--- a/doc/todo/import_tree.mdwn
+++ b/doc/todo/import_tree.mdwn
@ -11,7 +11,7 @@ that has the modifications in it.
 Updating the working copy is then done by merging the import treeish.
 This way, conflicts will be detected and handled as normal by git.

----
+## content identifiers

 The remote is responsible for collecting a list of
 files currently in it, along with some content identifier. That data is
@ -53,7 +53,28 @@ the same remote? In that case, perhaps different trees would be imported,
 and merged into master. So the two repositories then have differing
 masters, which can be reconciled in merge as usual.

----
+Since exporttree remotes don't have content identifier information yet, it
+needs to be collected the first time import tree is used. (Or import
+everything, but that is probably too expensive). Any modifications made to
+exported files before the first import tree would not be noticed. Seems
+acceptible as long as this only affects exporttree remotes created before
+this feature was added.
+
+What if repo A is being used to import tree from R for a while, and the
+user gets used to editing files on R and importing them. Then they stop
+using A and switch to clone B. It would not have the content identifier
+information that A did. It seems that in this case, B needs to re-download
+everything, to build up the map of content identifiers.
+(Anything could have changed since the last time A imported).
+That seems too expensive!
+
+Would storing content identifiers in the git-annex branch be too
+expensive? Probably not.. For S3 with versioning a content identifier is
+already stored. When the content identifier is (mtime, size, inode),
+that's a small amount of data. The maximum size of a content identifier
+could be limited to the size of a typical hash, and if a remote for some
+reason gets something larger, it could simply hash it to generate
+the content identifier.

 ## race conditions TODO

@ -152,25 +173,6 @@ Since this is acceptable in git, I suppose we can accept it here too..

 ----

-Since exporttree remotes don't have content identifier information yet, it
-needs to be collected the first time import tree is used. (Or import
-everything, but that is probably too expensive). Any modifications made to
-exported files before the first import tree would not be noticed. Seems
-acceptible as long as this only affects exporttree remotes created before
-this feature was added.
-
-What if repo A is being used to import tree from R for a while, and the
-user gets used to editing files on R and importing them. Then they stop
-using A and switch to clone B. It would not have the content identifier
-information that A did (unless it's stored in git-annex branch rather than
-locally). It seems that in this case, B needs to re-download everything,
-since anything could have changed since the last time A imported.
-That seems too expensive!
-
-Would storing content identifiers in the git-annex branch be too expensive?
-
----
-
 If multiple repos can access the remote at the same time, then there's a
 potential problem when one is exporting a new tree, and the other one is
 importing from the remote.
@ -187,6 +189,33 @@ importing from the remote.
 > to be on the remote. (May need to reword that prompt.)
 > --[[Joey]]

+## api design
+
+Pulling all of the above together, this is an extension to the
+ExportActions api.
+
+	listContents :: Annex [(ExportLocation, ContentIdentifier)]
+
+	getContentIdentifier :: ExportLocation -> Annex (Maybe ContentIdentifier)
+	
+	retrieveExportWithContentIdentifier :: ExportLocation -> ContentIdentifier -> FilePath -> MeterUpdate -> Annex Bool
+
+	storeExportWithContentIdentifier :: FilePath -> Key -> ExportLocation -> MeterUpdate -> Annex (Maybe ContentIdentifier)
+
+retrieveExportWithContentIdentifier is used when downloading a new file from 
+the remote that listContents found. retrieveExport can't be used because
+it has a Key parameter and the key is not yet known in this case.
+
+storeExportWithContentIdentifier is used to get the content identifier
+corresponding to what was just stored. It can either get the content
+identifier in reply to the store (as S3 does with versioning), or it can
+store to a temp location, get the content identifier of that, and then
+rename the content into place. When there's a race with a concurrent
+writer, it needs to avoid getting the ContentIdentifier for data written by
+the other writer.
+
+TODO what's needed to work around the other race condition discussed above?
+
 ----

 See also, [[adb_special_remote]]