From aaeadc422acbe7ca1055a8af525ed463246cac45 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 24 Oct 2023 13:54:31 -0400 Subject: [PATCH] comment --- ..._c232e1e1cfcc47f70079f2d32c2b4633._comment | 4 +-- ..._e81719f23565579674249db5d0a883da._comment | 27 +++++++++++++++++++ 2 files changed, 29 insertions(+), 2 deletions(-) create mode 100644 doc/todo/Incremental_git_annex_sync_--content_--all/comment_5_e81719f23565579674249db5d0a883da._comment diff --git a/doc/todo/Incremental_git_annex_sync_--content_--all/comment_4_c232e1e1cfcc47f70079f2d32c2b4633._comment b/doc/todo/Incremental_git_annex_sync_--content_--all/comment_4_c232e1e1cfcc47f70079f2d32c2b4633._comment index 78f0c942a9..bd167f075e 100644 --- a/doc/todo/Incremental_git_annex_sync_--content_--all/comment_4_c232e1e1cfcc47f70079f2d32c2b4633._comment +++ b/doc/todo/Incremental_git_annex_sync_--content_--all/comment_4_c232e1e1cfcc47f70079f2d32c2b4633._comment @@ -6,6 +6,6 @@ My recent optimisations of `git-annex sync` with importtree remotes uses a similar diffing approach. -A transition is underway to making `--content` be enabled by default, and -faster syncing with it would be a nice thing to do before then. +`git-annex satisfy` syncs `--content` by default, so this optimisation would +be especially nice to have for it. """]] diff --git a/doc/todo/Incremental_git_annex_sync_--content_--all/comment_5_e81719f23565579674249db5d0a883da._comment b/doc/todo/Incremental_git_annex_sync_--content_--all/comment_5_e81719f23565579674249db5d0a883da._comment new file mode 100644 index 0000000000..d68d10d4a7 --- /dev/null +++ b/doc/todo/Incremental_git_annex_sync_--content_--all/comment_5_e81719f23565579674249db5d0a883da._comment @@ -0,0 +1,27 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 5""" + date="2023-10-24T17:26:53Z" + content=""" +To implement this optimisation for a non-all sync, when +the tree being synced has changed, it ought to diff from the old +tree to the current tree, and sync those files. Preferred +content can vary depending on filename, and diffing like that will avoid +scanning every file in the whole tree. + +And when there are location log changes, it needs to also sync files in the +tree that use keys whose location log changed, using the git-annex branch +diff to find those keys. (And presumably then using the keys database to get +back to the filenames.) + +So, implementing an optimisation like this for a non-all sync has two +separate diffs which would have to be combined together somehow. + +Doing that in constant memory would be hard. It seems that a bloom filter +cannot be used to check if a file was processed in the first diff and avoid +processing it again in the second diff. Because a false positive would +avoid processing a file whose location log did change. I think it would +need to use an on-disk structure maybe (eg sqlite)? + +None of which should prevent implementing this nice optimisation for --all. +"""]]