add ContentIndentifiersCidRemoteKeyIndex

Optimise database to further speed up importing large trees from special
remotes.

See comment for details of why the other index didn't help cid queries.

It would probably be better to manually create an index on only cid, rather
than adding a second uniqueness constraint that is a larger index. But
persitent does not support creating indexes, and an attempt to manually add
it to the migration failed.

Sponsored-by: Nicholas Golder-Manning on Patreon
This commit is contained in:
Joey Hess 2023-06-09 15:12:33 -04:00
parent 532b227086
commit a0ab425c95
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 83 additions and 3 deletions

View file

@ -0,0 +1,67 @@
[[!comment format=mdwn
username="joey"
subject="""comment 20"""
date="2023-06-09T16:01:06Z"
content="""
The sqlite schema is:
CREATE TABLE IF NOT EXISTS "content_identifiers"("id" INTEGER
PRIMARY KEY,"remote" BLOB NOT NULL,"cid" BLOB NOT NULL,"key" BLOB
NOT NULL,CONSTRAINT "content_indentifiers_key_remote_cid_index"
UNIQUE ("key","remote","cid"));
According to the sqlite docs:
In most cases, UNIQUE and PRIMARY KEY constraints are implemented by
creating a unique index in the database. (The exceptions are INTEGER
PRIMARY KEY and PRIMARY KEYs on WITHOUT ROWID tables.)
Hence, the following schemas are logically equivalent:
CREATE TABLE t1(a, b UNIQUE);
CREATE TABLE t1(a, b PRIMARY KEY);
CREATE TABLE t1(a, b);
CREATE UNIQUE INDEX t1b ON t1(b);
But, that index is not used for a cid query:
sqlite> EXPLAIN QUERY PLAN SELECT key FROM content_identifiers WHERE cid = "foo";
QUERY PLAN
`--SCAN content_identifiers
Aha, it does use the UNIQUE index for queries going the other way:
sqlite> EXPLAIN QUERY PLAN SELECT cid FROM content_identifiers WHERE key = "foo";
QUERY PLAN
`--SEARCH content_identifiers USING COVERING INDEX sqlite_autoindex_content_identifiers_1 (key=?)
Even a query using the "remote" column seems to use the UNIQUE index:
sqlite> EXPLAIN QUERY PLAN SELECT cid FROM content_identifiers WHERE key = "foo" and remote = "bar";
QUERY PLAN
`--SEARCH content_identifiers USING COVERING INDEX sqlite_autoindex_content_identifiers_1 (key=? AND remote=?)
Here's a thread on mastodon about this, with some ideas about why sqlite3 behaves this way.
<https://octodon.social/@joeyh/110515161416516029>.
Also, see this commit where I added a second index to a database, that was the same columns
as an existing index, only in a different order, and it did speed up a query.
[[!commit ca2a527e93ca22548983a7285fc6e0a892ca44a4]].
Oh and there was a uniqueness constraint with the cid first, but it was broken and got removed
in [[!commit c6e693b25de49d4d3b2fedb49ffb42f04f5fd544]]. So probably it got slower at that point
(Dec 2020).
So probably most other tables used by git-annex do have usable indexes, for
most queries. But not this one. Gonna need to manually add an index.
(Or possibly redo the schema to have `UNIQUE ("key","cid","remote")`, it may be
that sqlite only creates an index for up to 2 columns or something like that.)
sqlite> CREATE INDEX idx_content_identifiers_cid on content_identifiers (cid);
sqlite> EXPLAIN QUERY PLAN SELECT key FROM content_identifiers WHERE cid = "foo";
QUERY PLAN
`--SEARCH content_identifiers USING INDEX idx_content_identifiers_cid (cid=?)
After modifying the database this way, I re-ran the slow sync and it's now able
to run getContentIdentifierKeys 2858 times a second. What an improvement!
"""]]