add ContentIndentifiersCidRemoteKeyIndex

Optimise database to further speed up importing large trees from special
remotes.

See comment for details of why the other index didn't help cid queries.

It would probably be better to manually create an index on only cid, rather
than adding a second uniqueness constraint that is a larger index. But
persitent does not support creating indexes, and an attempt to manually add
it to the migration failed.

Sponsored-by: Nicholas Golder-Manning on Patreon
This commit is contained in:
Joey Hess 2023-06-09 15:12:33 -04:00
parent 532b227086
commit a0ab425c95
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 83 additions and 3 deletions

View file

@ -78,6 +78,8 @@ git-annex (10.20230408) UNRELEASED; urgency=medium
* repair: Fix handling of git ref names on Windows.
* Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.
* Optimise database to further speed up importing large trees from special
remotes.
* Some other speedups to importing trees from special remotes.
* Cache negative lookups of global numcopies and mincopies.
Speeds up eg git-annex sync --content by up to 50%.

View file

@ -63,12 +63,24 @@ data ContentIdentifierHandle = ContentIdentifierHandle H.DbQueue Bool
databaseIsEmpty :: ContentIdentifierHandle -> Bool
databaseIsEmpty (ContentIdentifierHandle _ b) = b
-- Note on indexes: ContentIndentifiersKeyRemoteCidIndex etc are really
-- uniqueness constraints, which cause sqlite to automatically add indexes.
-- So when adding indexes, have to take care to only add ones that work as
-- uniqueness constraints. (Unfortunately persistent does not support indexes
-- that are not uniqueness constraints;
-- https://github.com/yesodweb/persistent/issues/109)
--
-- ContentIndentifiersKeyRemoteCidIndex speeds up queries like
-- getContentIdentifiers, but it is not used for
-- getContentIdentifierKeys. ContentIndentifiersCidRemoteKeyIndex was
-- addedto speed that up.
share [mkPersist sqlSettings, mkMigrate "migrateContentIdentifier"] [persistLowerCase|
ContentIdentifiers
remote UUID
cid ContentIdentifier
key Key
ContentIndentifiersKeyRemoteCidIndex key remote cid
ContentIndentifiersCidRemoteKeyIndex cid remote key
-- The last git-annex branch tree sha that was used to update
-- ContentIdentifiers
AnnexBranch
@ -89,9 +101,8 @@ openDb = do
if isnew
then initDb db $ void $
runMigrationSilent migrateContentIdentifier
-- Migrate from old version of database, which had
-- an incorrect uniqueness constraint on the
-- ContentIdentifiers table.
-- Migrate from old versions of database, which had buggy
-- and suboptimal uniqueness constraints.
else liftIO $ runSqlite (T.pack (fromRawFilePath db)) $ void $
runMigrationSilent migrateContentIdentifier
h <- liftIO $ H.openDbQueue db "content_identifiers"

View file

@ -0,0 +1,67 @@
[[!comment format=mdwn
username="joey"
subject="""comment 20"""
date="2023-06-09T16:01:06Z"
content="""
The sqlite schema is:
CREATE TABLE IF NOT EXISTS "content_identifiers"("id" INTEGER
PRIMARY KEY,"remote" BLOB NOT NULL,"cid" BLOB NOT NULL,"key" BLOB
NOT NULL,CONSTRAINT "content_indentifiers_key_remote_cid_index"
UNIQUE ("key","remote","cid"));
According to the sqlite docs:
In most cases, UNIQUE and PRIMARY KEY constraints are implemented by
creating a unique index in the database. (The exceptions are INTEGER
PRIMARY KEY and PRIMARY KEYs on WITHOUT ROWID tables.)
Hence, the following schemas are logically equivalent:
CREATE TABLE t1(a, b UNIQUE);
CREATE TABLE t1(a, b PRIMARY KEY);
CREATE TABLE t1(a, b);
CREATE UNIQUE INDEX t1b ON t1(b);
But, that index is not used for a cid query:
sqlite> EXPLAIN QUERY PLAN SELECT key FROM content_identifiers WHERE cid = "foo";
QUERY PLAN
`--SCAN content_identifiers
Aha, it does use the UNIQUE index for queries going the other way:
sqlite> EXPLAIN QUERY PLAN SELECT cid FROM content_identifiers WHERE key = "foo";
QUERY PLAN
`--SEARCH content_identifiers USING COVERING INDEX sqlite_autoindex_content_identifiers_1 (key=?)
Even a query using the "remote" column seems to use the UNIQUE index:
sqlite> EXPLAIN QUERY PLAN SELECT cid FROM content_identifiers WHERE key = "foo" and remote = "bar";
QUERY PLAN
`--SEARCH content_identifiers USING COVERING INDEX sqlite_autoindex_content_identifiers_1 (key=? AND remote=?)
Here's a thread on mastodon about this, with some ideas about why sqlite3 behaves this way.
<https://octodon.social/@joeyh/110515161416516029>.
Also, see this commit where I added a second index to a database, that was the same columns
as an existing index, only in a different order, and it did speed up a query.
[[!commit ca2a527e93ca22548983a7285fc6e0a892ca44a4]].
Oh and there was a uniqueness constraint with the cid first, but it was broken and got removed
in [[!commit c6e693b25de49d4d3b2fedb49ffb42f04f5fd544]]. So probably it got slower at that point
(Dec 2020).
So probably most other tables used by git-annex do have usable indexes, for
most queries. But not this one. Gonna need to manually add an index.
(Or possibly redo the schema to have `UNIQUE ("key","cid","remote")`, it may be
that sqlite only creates an index for up to 2 columns or something like that.)
sqlite> CREATE INDEX idx_content_identifiers_cid on content_identifiers (cid);
sqlite> EXPLAIN QUERY PLAN SELECT key FROM content_identifiers WHERE cid = "foo";
QUERY PLAN
`--SEARCH content_identifiers USING INDEX idx_content_identifiers_cid (cid=?)
After modifying the database this way, I re-ran the slow sync and it's now able
to run getContentIdentifierKeys 2858 times a second. What an improvement!
"""]]