diff --git a/CHANGELOG b/CHANGELOG index 452dca605a..1607d765c3 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -78,6 +78,8 @@ git-annex (10.20230408) UNRELEASED; urgency=medium * repair: Fix handling of git ref names on Windows. * Large speed up to importing trees from special remotes that contain a lot of files, by only processing changed files. + * Optimise database to further speed up importing large trees from special + remotes. * Some other speedups to importing trees from special remotes. * Cache negative lookups of global numcopies and mincopies. Speeds up eg git-annex sync --content by up to 50%. diff --git a/Database/ContentIdentifier.hs b/Database/ContentIdentifier.hs index 7eb1bd0d58..9700f4b693 100644 --- a/Database/ContentIdentifier.hs +++ b/Database/ContentIdentifier.hs @@ -63,12 +63,24 @@ data ContentIdentifierHandle = ContentIdentifierHandle H.DbQueue Bool databaseIsEmpty :: ContentIdentifierHandle -> Bool databaseIsEmpty (ContentIdentifierHandle _ b) = b +-- Note on indexes: ContentIndentifiersKeyRemoteCidIndex etc are really +-- uniqueness constraints, which cause sqlite to automatically add indexes. +-- So when adding indexes, have to take care to only add ones that work as +-- uniqueness constraints. (Unfortunately persistent does not support indexes +-- that are not uniqueness constraints; +-- https://github.com/yesodweb/persistent/issues/109) +-- +-- ContentIndentifiersKeyRemoteCidIndex speeds up queries like +-- getContentIdentifiers, but it is not used for +-- getContentIdentifierKeys. ContentIndentifiersCidRemoteKeyIndex was +-- addedto speed that up. share [mkPersist sqlSettings, mkMigrate "migrateContentIdentifier"] [persistLowerCase| ContentIdentifiers remote UUID cid ContentIdentifier key Key ContentIndentifiersKeyRemoteCidIndex key remote cid + ContentIndentifiersCidRemoteKeyIndex cid remote key -- The last git-annex branch tree sha that was used to update -- ContentIdentifiers AnnexBranch @@ -89,9 +101,8 @@ openDb = do if isnew then initDb db $ void $ runMigrationSilent migrateContentIdentifier - -- Migrate from old version of database, which had - -- an incorrect uniqueness constraint on the - -- ContentIdentifiers table. + -- Migrate from old versions of database, which had buggy + -- and suboptimal uniqueness constraints. else liftIO $ runSqlite (T.pack (fromRawFilePath db)) $ void $ runMigrationSilent migrateContentIdentifier h <- liftIO $ H.openDbQueue db "content_identifiers" diff --git a/doc/bugs/importtree_spends_hours_reading_cidsdb/comment_20_d6d3db42bda1fe094140d54a458157b1._comment b/doc/bugs/importtree_spends_hours_reading_cidsdb/comment_20_d6d3db42bda1fe094140d54a458157b1._comment new file mode 100644 index 0000000000..15a68b211b --- /dev/null +++ b/doc/bugs/importtree_spends_hours_reading_cidsdb/comment_20_d6d3db42bda1fe094140d54a458157b1._comment @@ -0,0 +1,67 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 20""" + date="2023-06-09T16:01:06Z" + content=""" +The sqlite schema is: + + CREATE TABLE IF NOT EXISTS "content_identifiers"("id" INTEGER + PRIMARY KEY,"remote" BLOB NOT NULL,"cid" BLOB NOT NULL,"key" BLOB + NOT NULL,CONSTRAINT "content_indentifiers_key_remote_cid_index" + UNIQUE ("key","remote","cid")); + +According to the sqlite docs: + + In most cases, UNIQUE and PRIMARY KEY constraints are implemented by + creating a unique index in the database. (The exceptions are INTEGER + PRIMARY KEY and PRIMARY KEYs on WITHOUT ROWID tables.) + Hence, the following schemas are logically equivalent: + + CREATE TABLE t1(a, b UNIQUE); + + CREATE TABLE t1(a, b PRIMARY KEY); + + CREATE TABLE t1(a, b); + CREATE UNIQUE INDEX t1b ON t1(b); + +But, that index is not used for a cid query: + + sqlite> EXPLAIN QUERY PLAN SELECT key FROM content_identifiers WHERE cid = "foo"; + QUERY PLAN + `--SCAN content_identifiers + +Aha, it does use the UNIQUE index for queries going the other way: + + sqlite> EXPLAIN QUERY PLAN SELECT cid FROM content_identifiers WHERE key = "foo"; + QUERY PLAN + `--SEARCH content_identifiers USING COVERING INDEX sqlite_autoindex_content_identifiers_1 (key=?) + +Even a query using the "remote" column seems to use the UNIQUE index: + + sqlite> EXPLAIN QUERY PLAN SELECT cid FROM content_identifiers WHERE key = "foo" and remote = "bar"; + QUERY PLAN + `--SEARCH content_identifiers USING COVERING INDEX sqlite_autoindex_content_identifiers_1 (key=? AND remote=?) + +Here's a thread on mastodon about this, with some ideas about why sqlite3 behaves this way. +. +Also, see this commit where I added a second index to a database, that was the same columns +as an existing index, only in a different order, and it did speed up a query. +[[!commit ca2a527e93ca22548983a7285fc6e0a892ca44a4]]. +Oh and there was a uniqueness constraint with the cid first, but it was broken and got removed +in [[!commit c6e693b25de49d4d3b2fedb49ffb42f04f5fd544]]. So probably it got slower at that point +(Dec 2020). + +So probably most other tables used by git-annex do have usable indexes, for +most queries. But not this one. Gonna need to manually add an index. + +(Or possibly redo the schema to have `UNIQUE ("key","cid","remote")`, it may be +that sqlite only creates an index for up to 2 columns or something like that.) + + sqlite> CREATE INDEX idx_content_identifiers_cid on content_identifiers (cid); + sqlite> EXPLAIN QUERY PLAN SELECT key FROM content_identifiers WHERE cid = "foo"; + QUERY PLAN + `--SEARCH content_identifiers USING INDEX idx_content_identifiers_cid (cid=?) + +After modifying the database this way, I re-ran the slow sync and it's now able +to run getContentIdentifierKeys 2858 times a second. What an improvement! +"""]]