comments

2020-06-30 16:41:31 -04:00 · 2020-06-30 16:41:31 -04:00 · 8f508d4406
commit 8f508d4406
parent 5891e10409
2 changed files with 56 additions and 0 deletions
--- a/doc/todo/speed_up_git_annex_sync_--content_--all/comment_1_dc1850b3ec3e4775a52f4cd9e311c5a9._comment
+++ b/doc/todo/speed_up_git_annex_sync_--content_--all/comment_1_dc1850b3ec3e4775a52f4cd9e311c5a9._comment
@ -0,0 +1,24 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 1"""
 date="2020-06-30T16:27:26Z"
 content="""
 I am surprised by 30m, unless you have a very large number of files
 or keys. Benchmarking in a repo with 100k files and keys, git-annex
 sync --content took 7m, and with --all 14m. (both cold cache) 
 How many files and keys does git-annex info say are in your repository?
 --all is slower because it does two passes, first over all files in the
 current branch, and a second pass over all keys. (Necessary to handle
 preferred content filename matching correctly.)
 git cat-file --batch-all-objects --unordered would not work, because
 there would be no way to know if a particular blob was a current version of
 the location log for a key, or an old version.
 (For that matter, it doesn't even say what file the blob belongs to, so
 it would have no idea what it's a location log for.)
 [[design/caching_database]] talks about speedups by sqlite caching. But
 that's a hard road.
 """]]
--- a/doc/todo/speed_up_git_annex_sync_--content_--all/comment_2_5ab6db1e8b5f3131cfc61626bab7b8d9._comment
+++ b/doc/todo/speed_up_git_annex_sync_--content_--all/comment_2_5ab6db1e8b5f3131cfc61626bab7b8d9._comment
@ -0,0 +1,32 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 2"""
 date="2020-06-30T18:53:52Z"
 content="""
 I wonder if the second pass could avoid looking at the location log at all 
 for keys handled in the first pass. Currently it only checks the bloom
 filter to skip dropping those keys, but not to skip transferring the keys. If it
 also skipped transferring, it would not need to read the location log a second
 time for the same key.
 That would speed it up by around 2x, in fairly typical cases where there are a
 lot of files, but not a lot of old versions of each file.
 Problem with that is, it's a bloom filter, there can be false positives.
 Currently a false positive means it does not drop a key that it should want
 to, annoying maybe but unlikely to happen and not a big problem. But
 consulting the bloom filter for transfers would risk it not making as many
 copies of a key as it is configured to, which risks data loss, or at least not
 having all desired data available after sync.
 But, if it could use something other than a bloom filter to keep track
 of the keys processed in the first pass, that would be a good optimisation.
 Sqlite database maybe, have to consider the overhead of querying it. Just
 keeping the keys in memory w/o a bloom filter maybe, and only use the bloom
 filter if there are too many keys.
 The bloom filter used currently uses around 16 mb of ram. A typical key is
 80 bytes or so. So, up to around 200,000 keys in a set is probably the same
 ballpark amount of ram. (32 mb would be needed to construct the
 bloom filter from the set, probably.)
 """]]