reproduced

This commit is contained in:
Joey Hess 2021-06-14 11:37:21 -04:00
parent 26a9ea12d1
commit 0eff5a3f71
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 48 additions and 0 deletions

View file

@ -0,0 +1,31 @@
[[!comment format=mdwn
username="joey"
subject="""comment 23"""
date="2021-06-14T14:09:07Z"
content="""
The file contents all being the same is the crucial thing. On linux,
adding 1000 dup files at a time (all in same directory), I get:
run 1: 0:08
run 2: 0:42
run 3: 1:14
run 4: 1:46
After run 4, adding 1000 files with all different content takes
0:11, so not appreciably slowed down; it only affects adding dups,
and only when there are a *lot* of them.
This feels like quite an edge case, and also not
really a new problem, since unlocked files would have already
had the same problem before recent changes.
I thought this might be an innefficiency in sqlite's index, similar to how
hash tables can scale poorly when a lot of things end up in the same
bucket. But disabling the index did not improve performance.
Aha -- the slowdown is caused by `git-annex add` looking to see what other
annexed files use the same content, so that it can populate any unlocked
files that didn't have the content present before. With all these locked
files now recorded in the db, it has to check each file in turn, and
there's the `O(N^2)`
"""]]

View file

@ -0,0 +1,17 @@
[[!comment format=mdwn
username="joey"
subject="""comment 24"""
date="2021-06-14T15:36:30Z"
content="""
If the database recorded when files were unlocked or not, that could be
avoided, but tracking that would add a lot of complexity for what is just
an edge case. And probably slow things down generally by some amount due to
the db being larger.
It seems almost cheating, but it could remember the last few keys it's added,
and avoid trying to populate unlocked files when adding those keys again.
This would slow down the usual case by some tiny amount (eg an IORef access)
but avoid `O(N^2)` in this edge case. Though it wouldn't fix all edge cases,
eg when the files it's adding rotate through X different contents, and X is
larger than the number of keys it remembers.
"""]]