This commit is contained in:
Joey Hess 2023-06-02 12:13:50 -04:00
parent f6dd34ca81
commit b40b368857
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38

View file

@ -0,0 +1,33 @@
[[!comment format=mdwn
username="joey"
subject="""comment 8"""
date="2023-06-02T15:25:12Z"
content="""
@jgoerzen if you want to take a look at the sql, see
`Database/ContentIdentifier.hs`. `getContentIdentifierKeys` is the query
that it's running on each file. I'm not really sure right now if the
persistent schema in there actually creates an index that is used for that
query. persistent's documentation of indexes is lacking and I may have
misunderstood that uniqueness constraints result in indexes being created.
Dumping the database shows this, which really doesn't seem to have an index
after all:
CREATE TABLE IF NOT EXISTS "content_identifiers"("id" INTEGER
PRIMARY KEY,"remote" BLOB NOT NULL,"cid" BLOB NOT NULL,"key" BLOB
NOT NULL,CONSTRAINT "content_indentifiers_key_remote_cid_index"
UNIQUE ("key","remote","cid"));
May need some raw sql to add it, like:
CREATE INDEX cidindex ON "content_identifiers" ("cid");
Also, I re-ran the 150000 file sync benchmark with `getContentIdentifierKeys`
disabled and it took 29:56.78, so 25% faster.
That gives me the idea for an optimisation -- it could check if the
database is empty at start and if so, avoid calling that at all. (It also
maintains a map in memory which will still allow it to detect duplicate files.)
Speeding up initial imports of a lot of files, but not later imports of a lot
of files is kind of a cop out, but..
"""]]