comment

2023-06-02 12:13:50 -04:00 · 2023-06-02 12:13:50 -04:00 · b40b368857
commit b40b368857
parent f6dd34ca81
1 changed files with 33 additions and 0 deletions
--- a/doc/bugs/importtree_spends_hours_reading_cidsdb/comment_8_1389db945973ed42b8ddd0de3cc8889c._comment
+++ b/doc/bugs/importtree_spends_hours_reading_cidsdb/comment_8_1389db945973ed42b8ddd0de3cc8889c._comment
@ -0,0 +1,33 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 8"""
 date="2023-06-02T15:25:12Z"
 content="""
@jgoerzen if you want to take a look at the sql, see
 `Database/ContentIdentifier.hs`. `getContentIdentifierKeys` is the query
 that it's running on each file. I'm not really sure right now if the 
 persistent schema in there actually creates an index that is used for that
 query. persistent's documentation of indexes is lacking and I may have
 misunderstood that uniqueness constraints result in indexes being created.
 Dumping the database shows this, which really doesn't seem to have an index
 after all:
 	CREATE TABLE IF NOT EXISTS "content_identifiers"("id" INTEGER
 	PRIMARY KEY,"remote" BLOB NOT NULL,"cid" BLOB NOT NULL,"key" BLOB
 	NOT NULL,CONSTRAINT "content_indentifiers_key_remote_cid_index"
 	UNIQUE ("key","remote","cid"));
 May need some raw sql to add it, like:
 	CREATE INDEX cidindex ON "content_identifiers" ("cid");
 Also, I re-ran the 150000 file sync benchmark with `getContentIdentifierKeys` 
 disabled and it took 29:56.78, so 25% faster.
 That gives me the idea for an optimisation -- it could check if the
 database is empty at start and if so, avoid calling that at all. (It also
 maintains a map in memory which will still allow it to detect duplicate files.)
 Speeding up initial imports of a lot of files, but not later imports of a lot
 of files is kind of a cop out, but..
 """]]