more thoughts

2018-06-14 13:30:34 -04:00 · 2018-06-14 13:30:34 -04:00 · 466d3fbaab
commit 466d3fbaab
parent cc4b3b9c06
1 changed files with 62 additions and 9 deletions
--- a/doc/todo/import_tree.mdwn
+++ b/doc/todo/import_tree.mdwn
@ -23,17 +23,70 @@ files to the remote with their mtime set to the date of the treeish
 being exported (when the treeish is a commit, which has dates, and not
 a raw tree). Then the remote can simply enumerate all files,
 with their mtimes, and look for files that have mtimes
-newer than the last exported treeish's date, as well as noticing
+newer than the last exported treeish's date.
 deleted and newly added/renamed files.
-> Hmm, but if files on the remote are being changed at the same time
+> But: If files on the remote are being changed at around the time
-> as the export, then they could have older mtimes, and be missed.
+> of the export, they could have older mtimes than the exported treeish's
-> --[[Joey]]
+> date, and so be missed.
 > 
 > Also, a rename that swaps two files would be missed if mtimes
 > are only compared to the treeish's date.
-A similar approach is for the remote to preserve object file timestamps,
+A perhaps better way is for the remote to keep track of the mtime,
-but keep a list somewhere (eg a file on the remote) of the timestamps of
+size, etc of all exported files, and use that state to find changes.
-each exported file, and then it can later look for files with newer
+Where to store that data?
-timestamps.
+
 The data could be stored in a file/files on the remote, or perhaps
 the remote has a way to store some arbitrary metadata about a file
 that could be used. Note that's basically the same as implementing the git
 index, on a per-remote basis.
 It could be stored in git-annex branch per-remote state. However,
 that state is per-key, not per-file. The export database could be
 used to convert a ExportLocation to a Key, which could be used
 to access the per-remote state. Querying the database for each file
 in the export could be a bottleneck without the right interface.
 If only one repository will ever access the remote, it could be stored
 in eg a local database. But access from only one repository is a 
 hard invariant to guarantee.
 Would local storage pose a problem when multiple repositories import from
 the same remote? In that case, perhaps different trees would be imported,
 and merged into master. So the two repositories then have differing
 masters, which can be reconciled as usual. It would mean extra downloads
 of content from the remote, since each import would download its own copy.
 Perhaps this is acceptable?
 ----
 Following the thoughts above, how about this design: The remote
 is responsible for collecting a list of files currently in it, along with
 some content identifier. That data is sent to git-annex. git-annex stores
 the content identifiers locally, and compares old and new lists to determine
 when a file on the remote has changed or is new.
 This way, each special remote doesn't have to reimplement the equivilant of
 the git index, or comparing lists of files, it only needs a way to list
 files, and a good content identifier.
 A good content identifier needs to:
 * Be stable, so when a file has not changed, the content identifier
  remains the same.
 * Change when a file is modified.
 * Be reasonably unique, but not necessarily fully unique.  
  For example, if the mtime of a file is used as the content identifier, then
  a rename that swaps two files would be noticed, except for in the
  unusual case where they have the same mtime. If a new file (or a copy)
  is added with the same mtime as some other file in the tree though,
  git-annex will see that the file is new, and so can still import it, even
  though it's seen that content identifier before. Of course, that might
  result in unncessary downloads, so a more unique content identifer would
  be better.
 A (size, mtime, inode) tuple is as good a content identifier as git uses in
 its index. That or a hash of the content would be ideal.
 ----