reproduced

2021-06-14 11:37:21 -04:00 · 2021-06-14 11:37:21 -04:00 · 0eff5a3f71
commit 0eff5a3f71
parent 26a9ea12d1
2 changed files with 48 additions and 0 deletions
--- a/doc/bugs/significant_performance_regression_impacting_datal/comment_23_8693e7e9c800f25cbd274b6781d834d6._comment
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_23_8693e7e9c800f25cbd274b6781d834d6._comment
@ -0,0 +1,31 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 23"""
 date="2021-06-14T14:09:07Z"
 content="""
 The file contents all being the same is the crucial thing. On linux,
 adding 1000 dup files at a time (all in same directory), I get:
 run 1: 0:08  
 run 2: 0:42  
 run 3: 1:14  
 run 4: 1:46
 After run 4, adding 1000 files with all different content takes 
 0:11, so not appreciably slowed down; it only affects adding dups,
 and only when there are a *lot* of them.
 This feels like quite an edge case, and also not
 really a new problem, since unlocked files would have already
 had the same problem before recent changes.
 I thought this might be an innefficiency in sqlite's index, similar to how
 hash tables can scale poorly when a lot of things end up in the same
 bucket. But disabling the index did not improve performance.
 Aha -- the slowdown is caused by `git-annex add` looking to see what other
 annexed files use the same content, so that it can populate any unlocked
 files that didn't have the content present before. With all these locked
 files now recorded in the db, it has to check each file in turn, and
 there's the `O(N^2)`
 """]]
--- a/doc/bugs/significant_performance_regression_impacting_datal/comment_24_6d11f6aa4b1a435bdf6d165eb8e6db8a._comment
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_24_6d11f6aa4b1a435bdf6d165eb8e6db8a._comment
@ -0,0 +1,17 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 24"""
 date="2021-06-14T15:36:30Z"
 content="""
 If the database recorded when files were unlocked or not, that could be
 avoided, but tracking that would add a lot of complexity for what is just
 an edge case. And probably slow things down generally by some amount due to
 the db being larger.
 It seems almost cheating, but it could remember the last few keys it's added,
 and avoid trying to populate unlocked files when adding those keys again.
 This would slow down the usual case by some tiny amount (eg an IORef access) 
 but avoid `O(N^2)` in this edge case. Though it wouldn't fix all edge cases,
 eg when the files it's adding rotate through X different contents, and X is
 larger than the number of keys it remembers.
 """]]