8bde6101e3
importfeed: Use caching database to avoid needing to list urls on every run, and avoid using too much memory. Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster, and memory use dropped from 203000k to 59408k. Database.ImportFeed is Database.ContentIdentifier with the serial number filed off. There is a bit of code duplication I would like to avoid, particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use the persistent sqlite tables, so despite the code being the same, they cannot be factored out. Since this database includes the contentidentifier metadata, it will be slightly redundant if a sqlite database is ever added for metadata. I did consider making such a generic database and using it for this. But, that would then need importfeed to update both the url database and the metadata database, which is twice as much work diffing the git-annex branch trees. Or would entagle updating two databases in a complex way. So instead it seems better to optimise the database that importfeed needs, and if the metadata database is used by another command, use a little more disk space and do a little bit of redundant work to update it. Sponsored-by: unqueued on Patreon
153 lines
5.1 KiB
Markdown
153 lines
5.1 KiB
Markdown
* [[metadata]] for views
|
|
* [[todo/cache_key_info]]
|
|
|
|
What do all these have in common? They could all be improved by
|
|
using some kind of database to locally store the information in an
|
|
efficient way.
|
|
|
|
The database should only function as a cache. It should be able to be
|
|
generated and updated by looking at the git repository.
|
|
|
|
* Metadata can be updated by looking at the git-annex branch,
|
|
either its current state, or the diff between the old and new versions
|
|
* Incremental fsck information is not stored in git, but can be
|
|
"regenerated" by running fsck again.
|
|
(Perhaps doesn't quite fit, but let it slide..)
|
|
|
|
Store in the database the Ref of the branch that was used to construct it.
|
|
(Update in same transaction as cached data.)
|
|
|
|
## implementation plan
|
|
|
|
1. Store incremental fsck info in db, on a branch, with sqlite. **done**
|
|
2. Make sure that builds on all platforms, and works reliably. **done**
|
|
3. Use sqlite db for associated files cache. **done** (only for v6 unlocked
|
|
files)
|
|
4. Use associated files db when dropping files, to fix
|
|
indeterminite preferred content state for duplicated files **done**
|
|
5. Also, use associated files db to construct views.
|
|
6. Use sqlite db for metadata cache.
|
|
7. Use sqlite db for list of keys present in local annex.
|
|
|
|
## sqlite or not?
|
|
|
|
sqllite seems like the most likely thing to work. But it does involve ugh,
|
|
SQL. And even if that's hidden by a layer like persistent, it's still going
|
|
to involve some technical debt (eg, database migrations).
|
|
|
|
It would be great if there were some haskell thing like acid-state
|
|
that I could use instead. But, acid-state needs to load the whole
|
|
DB into memory. In the comments of
|
|
[[bugs/incremental_fsck_should_not_use_sticky_bit]] I examined several
|
|
other haskell database-like things, and found them all wanting, except for
|
|
possibly TCache. (And TCache is backed by persistent/sqlite anyway.)
|
|
|
|
## one db or multiple?
|
|
|
|
Using a single database will use less space. Eg, each Key will only need to
|
|
appear in it once, with proper normalization.
|
|
|
|
OTOH, it's more complicated, and harder to recover from problems.
|
|
|
|
Currently leaning toward one database per purpose.
|
|
|
|
## case study: persistent with sqllite
|
|
|
|
Here's a non-normalized database schema in persistent's syntax.
|
|
|
|
<pre>
|
|
CachedKey
|
|
key Key
|
|
associatedFiles [FilePath]
|
|
lastFscked Int Maybe
|
|
KeyIndex key
|
|
|
|
CachedMetaData
|
|
key Key
|
|
metaDataField MetaDataField
|
|
metaDataValue MetaDataValue
|
|
</pre>
|
|
|
|
Using the above database schema and persistent with sqlite, I made
|
|
a database containing 30k Cache records. This took 5 seconds to create
|
|
and was 7 mb on disk. (Would be rather smaller, if a more packed Key
|
|
show/read instance were used.)
|
|
|
|
Running 1000 separate queries to get 1000 CachedKeys took 0.688s with warm
|
|
cache. This was more than halved when all 1000 queries were done inside the
|
|
same `runSqlite` call. (Which could be done using a separate thread and some
|
|
MVars.)
|
|
|
|
(Note that if the database is a cache, there is no need to perform migrations
|
|
when querying it. My benchmarks skip `runMigration`. Instead, if the query
|
|
fails, the database doesn't exist, or uses an incompatible schema, and the
|
|
cache can be rebuilt then. This avoids the problem that persistent's migrations
|
|
can sometimes fail.)
|
|
|
|
Doubling the db to 60k scaled linearly in disk and cpu and did not affect
|
|
query time.
|
|
|
|
----
|
|
|
|
Here's a normalized schema:
|
|
|
|
<pre>
|
|
CachedKey
|
|
key Key
|
|
KeyIndex key
|
|
deriving Show
|
|
|
|
AssociatedFiles
|
|
keyId CachedKeyId Eq
|
|
associatedFile FilePath
|
|
KeyIdIndex keyId associatedFile
|
|
deriving Show
|
|
|
|
CachedMetaField
|
|
field MetaField
|
|
FieldIndex field
|
|
|
|
CachedMetaData
|
|
keyId CachedKeyId Eq
|
|
fieldId CachedMetaFieldId Eq
|
|
metaValue String
|
|
|
|
LastFscked
|
|
keyId CachedKeyId Eq
|
|
localFscked Int Maybe
|
|
</pre>
|
|
|
|
With this, running 1000 joins to get the associated files of 1000
|
|
Keys took 5.6s with warm cache. (When done in the same `runSqlite` call.) Ouch!
|
|
|
|
Update: This performance was fixed by adding `KeyIdOutdex keyId associatedFile`,
|
|
which adds a uniqueness constraint on the tuple of key and associatedFile.
|
|
With this, 1000 queries takes 0.406s. Note that persistent is probably not
|
|
actually doing a join at the SQL level, so this could be sped up using
|
|
eg, esquelito.
|
|
|
|
Update2: Using esquelito to do a join got this down to 0.109s.
|
|
See `database` branch for code.
|
|
|
|
Update3: Converting to a single un-normalized table for AssociatedFiles
|
|
avoids the join, and increased lookup speed to 0.087s. Of course, when
|
|
a key has multiple associated files, this will use more disk space, due
|
|
to not normalizing the key.
|
|
|
|
Compare the above with 1000 calls to `associatedFiles`, which is approximately
|
|
as fast as just opening and reading 1000 files, so will take well under
|
|
0.05s with a **cold** cache.
|
|
|
|
So, we're looking at maybe 50% slowdown using sqlite and
|
|
persistent for associated files. OTOH, the normalized schema should
|
|
perform better when adding an associated file to a key that already has many.
|
|
|
|
For metadata, the story is much nicer. Querying for 30000 keys that all
|
|
have a particular tag in their metadata takes 0.65s. So fast enough to be
|
|
used in views.
|
|
|
|
Update4: Comparing git-annex fsck using the sticky bit to the final sqlite
|
|
implementation:
|
|
|
|
sticky bit: 4m30.787s
|
|
sqlite: 4m40.789s
|