comments

2020-06-30 16:41:31 -04:00 · 2020-06-30 16:41:31 -04:00 · 8f508d4406
commit 8f508d4406
parent 5891e10409
2 changed files with 56 additions and 0 deletions
--- a/doc/todo/speed_up_git_annex_sync_--content_--all/comment_1_dc1850b3ec3e4775a52f4cd9e311c5a9._comment
+++ b/doc/todo/speed_up_git_annex_sync_--content_--all/comment_1_dc1850b3ec3e4775a52f4cd9e311c5a9._comment
@ -0,0 +1,24 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2020-06-30T16:27:26Z"
+ content="""
+I am surprised by 30m, unless you have a very large number of files
+or keys. Benchmarking in a repo with 100k files and keys, git-annex
+sync --content took 7m, and with --all 14m. (both cold cache) 
+
+How many files and keys does git-annex info say are in your repository?
+
+--all is slower because it does two passes, first over all files in the
+current branch, and a second pass over all keys. (Necessary to handle
+preferred content filename matching correctly.)
+
+git cat-file --batch-all-objects --unordered would not work, because
+there would be no way to know if a particular blob was a current version of
+the location log for a key, or an old version.
+(For that matter, it doesn't even say what file the blob belongs to, so
+it would have no idea what it's a location log for.)
+
+[[design/caching_database]] talks about speedups by sqlite caching. But
+that's a hard road.
+"""]]
--- a/doc/todo/speed_up_git_annex_sync_--content_--all/comment_2_5ab6db1e8b5f3131cfc61626bab7b8d9._comment
+++ b/doc/todo/speed_up_git_annex_sync_--content_--all/comment_2_5ab6db1e8b5f3131cfc61626bab7b8d9._comment
@ -0,0 +1,32 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 2"""
+ date="2020-06-30T18:53:52Z"
+ content="""
+I wonder if the second pass could avoid looking at the location log at all 
+for keys handled in the first pass. Currently it only checks the bloom
+filter to skip dropping those keys, but not to skip transferring the keys. If it
+also skipped transferring, it would not need to read the location log a second
+time for the same key.
+
+That would speed it up by around 2x, in fairly typical cases where there are a
+lot of files, but not a lot of old versions of each file.
+
+Problem with that is, it's a bloom filter, there can be false positives.
+Currently a false positive means it does not drop a key that it should want
+to, annoying maybe but unlikely to happen and not a big problem. But
+consulting the bloom filter for transfers would risk it not making as many
+copies of a key as it is configured to, which risks data loss, or at least not
+having all desired data available after sync.
+
+But, if it could use something other than a bloom filter to keep track
+of the keys processed in the first pass, that would be a good optimisation.
+Sqlite database maybe, have to consider the overhead of querying it. Just
+keeping the keys in memory w/o a bloom filter maybe, and only use the bloom
+filter if there are too many keys.
+
+The bloom filter used currently uses around 16 mb of ram. A typical key is
+80 bytes or so. So, up to around 200,000 keys in a set is probably the same
+ballpark amount of ram. (32 mb would be needed to construct the
+bloom filter from the set, probably.)
+"""]]