thoughts on handling renames efficiently

This gets complicated, but I think this design will work! This commit was supported by the NSF-funded DataLad project.
2017-09-06 13:04:09 -04:00 · 2017-09-06 13:04:09 -04:00 · 1ec3a9eb05
commit 1ec3a9eb05
parent a1cc9ec0fd
2 changed files with 39 additions and 9 deletions
--- a/doc/design/exporting_trees_to_special_remotes.mdwn
+++ b/doc/design/exporting_trees_to_special_remotes.mdwn
@ -237,11 +237,37 @@ for the current treeish. (Unless a conflicting export was made from
 elsewhere, but in that case, the conflict resolution will have to fix up
 later.)

-Efficient resuming can then first check if the location log says the
-export contains the content. (If not, transfer a copy.) If the location
-log says the export contains the content, use CHECKPRESENTEXPORT to see if
-the file exists, and if not transfer a copy. The CHECKPRESENTEXPORT check
-deals with the case where the treeish has two files with the same content.
-If we have a key-to-files map for the export, then we can skip the 
-CHECKPRESENTEXPORT check when there's only one file using a key. So,
-resuming can be quite efficient.
+## handling renames efficiently
+
+To handle two files that swap names, a temp name is required.
+
+Difficulty with a temp name is picking a name that won't ever be used by
+any exported file.
+
+Interrupted exports also complicate this. While a name could be picked that
+is in neither the old nor the new tree, an export could be interrupted,
+leaving the file at the temp name. There needs to be something to clean
+that up when the export is resumed, even if it's resumed with a different 
+tree.
+
+Could use something like ".git-annex-tmp-content-$key" as the temp name.
+This hides it from casual view, which is good, and it's not depedent on the
+tree, so no state needs to be maintained to clean it up. Also, using the
+key in the name simplifies calculation of complicated renames (eg, renaming
+A to B, B to C, C to A)
+
+Export can first try to rename the temp name of all keys
+whose files are added in the diff. Followed by deleting the temp name
+of all keys whose files are removed in the diff. That is more renames and
+deletes than strictly necessary, but it will statelessly clean up 
+an interruped export as long as it's run again with the same new tree.
+
+But, an export of tree B should clean up after 
+an interrupted export of tree A. Some state is needed to handle this.
+Before starting the export of tree A, record it somewhere. Then when
+resuming, diff A..B, and rename/delete the temp names of the keys in the
+diff. As well as diffing from the last fully exported tree to B and doing
+the same rename/delete. 
+
+So, before an export does anything, need to record the tree that's about
+to be exported to export.log, not as an exported tree, but as a goal.
--- a/doc/todo/export.mdwn
+++ b/doc/todo/export.mdwn
@ -19,7 +19,11 @@ Work is in progress. Todo list:

 * `git annex get --from export` works in the repo that exported to it,
  but in another repo, the export db won't be populated, so it won't work.
-  Maybe just show a useful error message in this case?
+  Maybe just show a useful error message in this case?  
+  However, exporting from one repository and then trying to update the
+  export from another repository also doesn't work right, because the
+  export database is not populated. So, seems that the export database needs
+  to get populated based on the export log in these cases.
 * Efficient handling of renames.
 * Support export to aditional special remotes (S3 etc)
 * Support export to external special remotes.