comment

2023-06-01 14:21:55 -04:00 · 2023-06-01 14:21:55 -04:00 · 594110a6af
commit 594110a6af
parent 40017089f2
1 changed files with 28 additions and 0 deletions
--- a/doc/bugs/importtree_spends_hours_reading_cidsdb/comment_5_512996afefabc75a0b2258fc05ccbfdd._comment
+++ b/doc/bugs/importtree_spends_hours_reading_cidsdb/comment_5_512996afefabc75a0b2258fc05ccbfdd._comment
@ -0,0 +1,28 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 5"""
 date="2023-06-01T17:47:31Z"
 content="""
 As far as how reads from the cidsdb scale, I think that sqlite databases
 like this do slow down somewhat when tables get massive, but I don't really
 know the details. And it's of course possible that something could be
 improved in the schema or queries.
 I've been working on that todo I linked above, and the speed gain is
 impressive when there are few or no changed files in the remote. With 20,000
 unchanged files, re-running git-annex import[1] sped up from 125.95 to 3.84
 seconds. With 40,000 unchanged files, it sped up from 477 to 8.13 seconds. I
 haven't tried with 150000 files yet but the pattern is clear.
 > I can rerun the sync with an unchanged import directory.  It still takes
 > 107 minutes, the majority of which is spent reading cidsdb.  Only the
 > first minute or two are spent scanning the source area.
 Well, I think I've certianly solved that problem. But I don't know if there's
 something else that is making the initial sync slower than it needs to
 be.
 [1] More accurately, re-running it a second time, both to get a warm cache
 result, and because the first time, it is busy updating the cidsdb with
 the files that were imported earlier, as described in comment #2.
 """]]