comment

2023-06-01 14:21:55 -04:00 · 2023-06-01 14:21:55 -04:00 · 594110a6af
commit 594110a6af
parent 40017089f2
1 changed files with 28 additions and 0 deletions
--- a/doc/bugs/importtree_spends_hours_reading_cidsdb/comment_5_512996afefabc75a0b2258fc05ccbfdd._comment
+++ b/doc/bugs/importtree_spends_hours_reading_cidsdb/comment_5_512996afefabc75a0b2258fc05ccbfdd._comment
@ -0,0 +1,28 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 5"""
+ date="2023-06-01T17:47:31Z"
+ content="""
+As far as how reads from the cidsdb scale, I think that sqlite databases
+like this do slow down somewhat when tables get massive, but I don't really
+know the details. And it's of course possible that something could be
+improved in the schema or queries.
+
+I've been working on that todo I linked above, and the speed gain is
+impressive when there are few or no changed files in the remote. With 20,000
+unchanged files, re-running git-annex import[1] sped up from 125.95 to 3.84
+seconds. With 40,000 unchanged files, it sped up from 477 to 8.13 seconds. I
+haven't tried with 150000 files yet but the pattern is clear.
+
+> I can rerun the sync with an unchanged import directory.  It still takes
+> 107 minutes, the majority of which is spent reading cidsdb.  Only the
+> first minute or two are spent scanning the source area.
+
+Well, I think I've certianly solved that problem. But I don't know if there's
+something else that is making the initial sync slower than it needs to
+be.
+
+[1] More accurately, re-running it a second time, both to get a warm cache
+result, and because the first time, it is busy updating the cidsdb with
+the files that were imported earlier, as described in comment #2.
+"""]]