This commit is contained in:
Joey Hess 2023-06-01 14:21:55 -04:00
parent 40017089f2
commit 594110a6af
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38

View file

@ -0,0 +1,28 @@
[[!comment format=mdwn
username="joey"
subject="""comment 5"""
date="2023-06-01T17:47:31Z"
content="""
As far as how reads from the cidsdb scale, I think that sqlite databases
like this do slow down somewhat when tables get massive, but I don't really
know the details. And it's of course possible that something could be
improved in the schema or queries.
I've been working on that todo I linked above, and the speed gain is
impressive when there are few or no changed files in the remote. With 20,000
unchanged files, re-running git-annex import[1] sped up from 125.95 to 3.84
seconds. With 40,000 unchanged files, it sped up from 477 to 8.13 seconds. I
haven't tried with 150000 files yet but the pattern is clear.
> I can rerun the sync with an unchanged import directory. It still takes
> 107 minutes, the majority of which is spent reading cidsdb. Only the
> first minute or two are spent scanning the source area.
Well, I think I've certianly solved that problem. But I don't know if there's
something else that is making the initial sync slower than it needs to
be.
[1] More accurately, re-running it a second time, both to get a warm cache
result, and because the first time, it is busy updating the cidsdb with
the files that were imported earlier, as described in comment #2.
"""]]