thoughts on handling renames efficiently
This gets complicated, but I think this design will work! This commit was supported by the NSF-funded DataLad project.
This commit is contained in:
parent
a1cc9ec0fd
commit
1ec3a9eb05
2 changed files with 39 additions and 9 deletions
|
@ -237,11 +237,37 @@ for the current treeish. (Unless a conflicting export was made from
|
|||
elsewhere, but in that case, the conflict resolution will have to fix up
|
||||
later.)
|
||||
|
||||
Efficient resuming can then first check if the location log says the
|
||||
export contains the content. (If not, transfer a copy.) If the location
|
||||
log says the export contains the content, use CHECKPRESENTEXPORT to see if
|
||||
the file exists, and if not transfer a copy. The CHECKPRESENTEXPORT check
|
||||
deals with the case where the treeish has two files with the same content.
|
||||
If we have a key-to-files map for the export, then we can skip the
|
||||
CHECKPRESENTEXPORT check when there's only one file using a key. So,
|
||||
resuming can be quite efficient.
|
||||
## handling renames efficiently
|
||||
|
||||
To handle two files that swap names, a temp name is required.
|
||||
|
||||
Difficulty with a temp name is picking a name that won't ever be used by
|
||||
any exported file.
|
||||
|
||||
Interrupted exports also complicate this. While a name could be picked that
|
||||
is in neither the old nor the new tree, an export could be interrupted,
|
||||
leaving the file at the temp name. There needs to be something to clean
|
||||
that up when the export is resumed, even if it's resumed with a different
|
||||
tree.
|
||||
|
||||
Could use something like ".git-annex-tmp-content-$key" as the temp name.
|
||||
This hides it from casual view, which is good, and it's not depedent on the
|
||||
tree, so no state needs to be maintained to clean it up. Also, using the
|
||||
key in the name simplifies calculation of complicated renames (eg, renaming
|
||||
A to B, B to C, C to A)
|
||||
|
||||
Export can first try to rename the temp name of all keys
|
||||
whose files are added in the diff. Followed by deleting the temp name
|
||||
of all keys whose files are removed in the diff. That is more renames and
|
||||
deletes than strictly necessary, but it will statelessly clean up
|
||||
an interruped export as long as it's run again with the same new tree.
|
||||
|
||||
But, an export of tree B should clean up after
|
||||
an interrupted export of tree A. Some state is needed to handle this.
|
||||
Before starting the export of tree A, record it somewhere. Then when
|
||||
resuming, diff A..B, and rename/delete the temp names of the keys in the
|
||||
diff. As well as diffing from the last fully exported tree to B and doing
|
||||
the same rename/delete.
|
||||
|
||||
So, before an export does anything, need to record the tree that's about
|
||||
to be exported to export.log, not as an exported tree, but as a goal.
|
||||
|
|
|
@ -19,7 +19,11 @@ Work is in progress. Todo list:
|
|||
|
||||
* `git annex get --from export` works in the repo that exported to it,
|
||||
but in another repo, the export db won't be populated, so it won't work.
|
||||
Maybe just show a useful error message in this case?
|
||||
Maybe just show a useful error message in this case?
|
||||
However, exporting from one repository and then trying to update the
|
||||
export from another repository also doesn't work right, because the
|
||||
export database is not populated. So, seems that the export database needs
|
||||
to get populated based on the export log in these cases.
|
||||
* Efficient handling of renames.
|
||||
* Support export to aditional special remotes (S3 etc)
|
||||
* Support export to external special remotes.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue