From 8f508d4406454b0ed14620764d70d745e58380d8 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 30 Jun 2020 16:41:31 -0400 Subject: [PATCH] comments --- ..._dc1850b3ec3e4775a52f4cd9e311c5a9._comment | 24 ++++++++++++++ ..._5ab6db1e8b5f3131cfc61626bab7b8d9._comment | 32 +++++++++++++++++++ 2 files changed, 56 insertions(+) create mode 100644 doc/todo/speed_up_git_annex_sync_--content_--all/comment_1_dc1850b3ec3e4775a52f4cd9e311c5a9._comment create mode 100644 doc/todo/speed_up_git_annex_sync_--content_--all/comment_2_5ab6db1e8b5f3131cfc61626bab7b8d9._comment diff --git a/doc/todo/speed_up_git_annex_sync_--content_--all/comment_1_dc1850b3ec3e4775a52f4cd9e311c5a9._comment b/doc/todo/speed_up_git_annex_sync_--content_--all/comment_1_dc1850b3ec3e4775a52f4cd9e311c5a9._comment new file mode 100644 index 0000000000..bcb41f9bd0 --- /dev/null +++ b/doc/todo/speed_up_git_annex_sync_--content_--all/comment_1_dc1850b3ec3e4775a52f4cd9e311c5a9._comment @@ -0,0 +1,24 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2020-06-30T16:27:26Z" + content=""" +I am surprised by 30m, unless you have a very large number of files +or keys. Benchmarking in a repo with 100k files and keys, git-annex +sync --content took 7m, and with --all 14m. (both cold cache) + +How many files and keys does git-annex info say are in your repository? + +--all is slower because it does two passes, first over all files in the +current branch, and a second pass over all keys. (Necessary to handle +preferred content filename matching correctly.) + +git cat-file --batch-all-objects --unordered would not work, because +there would be no way to know if a particular blob was a current version of +the location log for a key, or an old version. +(For that matter, it doesn't even say what file the blob belongs to, so +it would have no idea what it's a location log for.) + +[[design/caching_database]] talks about speedups by sqlite caching. But +that's a hard road. +"""]] diff --git a/doc/todo/speed_up_git_annex_sync_--content_--all/comment_2_5ab6db1e8b5f3131cfc61626bab7b8d9._comment b/doc/todo/speed_up_git_annex_sync_--content_--all/comment_2_5ab6db1e8b5f3131cfc61626bab7b8d9._comment new file mode 100644 index 0000000000..44026911ee --- /dev/null +++ b/doc/todo/speed_up_git_annex_sync_--content_--all/comment_2_5ab6db1e8b5f3131cfc61626bab7b8d9._comment @@ -0,0 +1,32 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 2""" + date="2020-06-30T18:53:52Z" + content=""" +I wonder if the second pass could avoid looking at the location log at all +for keys handled in the first pass. Currently it only checks the bloom +filter to skip dropping those keys, but not to skip transferring the keys. If it +also skipped transferring, it would not need to read the location log a second +time for the same key. + +That would speed it up by around 2x, in fairly typical cases where there are a +lot of files, but not a lot of old versions of each file. + +Problem with that is, it's a bloom filter, there can be false positives. +Currently a false positive means it does not drop a key that it should want +to, annoying maybe but unlikely to happen and not a big problem. But +consulting the bloom filter for transfers would risk it not making as many +copies of a key as it is configured to, which risks data loss, or at least not +having all desired data available after sync. + +But, if it could use something other than a bloom filter to keep track +of the keys processed in the first pass, that would be a good optimisation. +Sqlite database maybe, have to consider the overhead of querying it. Just +keeping the keys in memory w/o a bloom filter maybe, and only use the bloom +filter if there are too many keys. + +The bloom filter used currently uses around 16 mb of ram. A typical key is +80 bytes or so. So, up to around 200,000 keys in a set is probably the same +ballpark amount of ram. (32 mb would be needed to construct the +bloom filter from the set, probably.) +"""]]