bloom doesn't work, but this should I hope

This commit is contained in:
Joey Hess 2021-06-14 17:53:01 -04:00
parent 2df4c1cf91
commit 6099edbf1c
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38

View file

@ -10,19 +10,23 @@ returning a long list of files. So it could detect say 10 files in the list
and start doing something other than the usual, without bothering the usual and start doing something other than the usual, without bothering the usual
case with any extra work. case with any extra work.
A bloom filter could be used to keep track of keys that have already had Git starts to get slow anyway in the 1 million to 10 million file range. So
their associated files populated, and be used to skip the work the next we can assume less than that many files are being added. And there need to
time that same key is added. In the false positive case, it would check the be a fairly large number of duplicates of a key for speed to become a problem
associated files as it does now, so no harm done. when adding that key. Around 1000 based on above benchmarks, but 100 would
be safer.
Putting these together, a bloom filter with a large enough capacity could If it's adding 10 million files, there can be at most 10000 keys
be set up when it detects the problem, and used to skip the redundant work. that have `>=` 1000 duplicates (10 million / 1000).
This would change the checking overhead from `O(N^2)` to O(N^F)` where F is No problem to remember 10000 keys; a key is less than 128 bytes long, so
the false positive rate of the bloom filter. And the false positive rate of that would take 1250 kb, plus the overhead of the Map. Might as well
the usual git-annex bloom filter is small: 1/1000000 when half a million remember 12 mb worth of keys, to catch 100 duplicates.
files are in it. Since 1-10 million files is where git gets too slow to be
usable, the false positive rate should remain low up until the point other It would be even better to use a bloom filter, which could remember many
performance becomes a problem. more, and I thought I had a way, but the false positive case seems the
wrong way around. If the bloom filter remembers keys that have already had
their associated files populated, then a false positive would prevent doing
that for a key that it's not been done for.
It would make sense to do this not only in populateUnlockedFiles but in It would make sense to do this not only in populateUnlockedFiles but in
Annex.Content.moveAnnex and Annex.Content.removeAnnex. Although removeAnnex Annex.Content.moveAnnex and Annex.Content.removeAnnex. Although removeAnnex