plan

2021-06-14 17:13:37 -04:00 · 2021-06-14 17:13:37 -04:00 · 2df4c1cf91
commit 2df4c1cf91
parent 0e3802c7ee
1 changed files with 31 additions and 0 deletions
--- a/doc/bugs/significant_performance_regression_impacting_datal/comment_32_0ab5e774dad1e6e74a5c72b95e40d659._comment
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_32_0ab5e774dad1e6e74a5c72b95e40d659._comment
@ -0,0 +1,31 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 32"""
+ date="2021-06-14T20:26:52Z"
+ content="""
+Some thoughts leading to a workable plan:
+
+It's easy to detect this edge case because getAssociatedFiles will be
+returning a long list of files. So it could detect say 10 files in the list
+and start doing something other than the usual, without bothering the usual
+case with any extra work.
+
+A bloom filter could be used to keep track of keys that have already had
+their associated files populated, and be used to skip the work the next
+time that same key is added. In the false positive case, it would check the
+associated files as it does now, so no harm done.
+
+Putting these together, a bloom filter with a large enough capacity could
+be set up when it detects the problem, and used to skip the redundant work.
+This would change the checking overhead from `O(N^2)` to O(N^F)` where F is
+the false positive rate of the bloom filter. And the false positive rate of
+the usual git-annex bloom filter is small: 1/1000000 when half a million
+files are in it. Since 1-10 million files is where git gets too slow to be
+usable, the false positive rate should remain low up until the point other
+performance becomes a problem.
+
+It would make sense to do this not only in populateUnlockedFiles but in
+Annex.Content.moveAnnex and Annex.Content.removeAnnex. Although removeAnnex
+would need a different bloom filter, since a file might have been populated
+and then somehow get removed in the same git-annex call.
+"""]]