156 lines
5.4 KiB
Markdown
156 lines
5.4 KiB
Markdown
* [[metadata]] for views
|
|
* [direct mode mappings scale badly with thousands of identical files](/bugs/__34__Adding_4923_files__34___is_really_slow)
|
|
* [[bugs/incremental_fsck_should_not_use_sticky_bit]]
|
|
* [[todo/wishlist:_pack_metadata_in_direct_mode]]
|
|
* [[todo/cache_key_info]]
|
|
|
|
What do all these have in common? They could all be improved by
|
|
using some kind of database to locally store the information in an
|
|
efficient way.
|
|
|
|
The database should only function as a cache. It should be able to be
|
|
generated and updated by looking at the git repository.
|
|
|
|
* Metadata can be updated by looking at the git-annex branch,
|
|
either its current state, or the diff between the old and new versions
|
|
* Direct mode mappings can be updated by looking at the current branch,
|
|
to see which files map to which key. Or the diff between the old
|
|
and new versions of the branch.
|
|
* Incremental fsck information is not stored in git, but can be
|
|
"regenerated" by running fsck again.
|
|
(Perhaps doesn't quite fit, but let it slide..)
|
|
|
|
Store in the database the Ref of the branch that was used to construct it.
|
|
(Update in same transaction as cached data.)
|
|
|
|
## implementation plan
|
|
|
|
1. Store incremental fsck info in db, on a branch, with sqlite. **done**
|
|
2. Make sure that builds on all platforms, and works reliably. **done**
|
|
3. Use sqlite db for associated files cache, and use in direct mode.
|
|
4. Also, use associated files db to construct views.
|
|
5. Use sqlite db for metadata cache.
|
|
6. Use sqlite db for list of keys present in local annex.
|
|
|
|
## sqlite or not?
|
|
|
|
sqllite seems like the most likely thing to work. But it does involve ugh,
|
|
SQL. And even if that's hidden by a layer like persistent, it's still going
|
|
to involve some technical debt (eg, database migrations).
|
|
|
|
It would be great if there were some haskell thing like acid-state
|
|
that I could use instead. But, acid-state needs to load the whole
|
|
DB into memory. In the comments of
|
|
[[bugs/incremental_fsck_should_not_use_sticky_bit]] I examined several
|
|
other haskell database-like things, and found them all wanting, except for
|
|
possibly TCache. (And TCache is backed by persistent/sqlite anyway.)
|
|
|
|
## one db or multiple?
|
|
|
|
Using a single database will use less space. Eg, each Key will only need to
|
|
appear in it once, with proper normalization.
|
|
|
|
OTOH, it's more complicated, and harder to recover from problems.
|
|
|
|
Currently leaning toward one database per purpose.
|
|
|
|
## case study: persistent with sqllite
|
|
|
|
Here's a non-normalized database schema in persistent's syntax.
|
|
|
|
<pre>
|
|
CachedKey
|
|
key Key
|
|
associatedFiles [FilePath]
|
|
lastFscked Int Maybe
|
|
KeyIndex key
|
|
|
|
CachedMetaData
|
|
key Key
|
|
metaDataField MetaDataField
|
|
metaDataValue MetaDataValue
|
|
</pre>
|
|
|
|
Using the above database schema and persistent with sqlite, I made
|
|
a database containing 30k Cache records. This took 5 seconds to create
|
|
and was 7 mb on disk. (Would be rather smaller, if a more packed Key
|
|
show/read instance were used.)
|
|
|
|
Running 1000 separate queries to get 1000 CachedKeys took 0.688s with warm
|
|
cache. This was more than halved when all 1000 queries were done inside the
|
|
same `runSqlite` call. (Which could be done using a separate thread and some
|
|
MVars.)
|
|
|
|
(Note that if the database is a cache, there is no need to perform migrations
|
|
when querying it. My benchmarks skip `runMigration`. Instead, if the query
|
|
fails, the database doesn't exist, or uses an incompatable schema, and the
|
|
cache can be rebuilt then. This avoids the problem that persistent's migrations
|
|
can sometimes fail.)
|
|
|
|
Doubling the db to 60k scaled linearly in disk and cpu and did not affect
|
|
query time.
|
|
|
|
----
|
|
|
|
Here's a normalized schema:
|
|
|
|
<pre>
|
|
CachedKey
|
|
key Key
|
|
KeyIndex key
|
|
deriving Show
|
|
|
|
AssociatedFiles
|
|
keyId CachedKeyId Eq
|
|
associatedFile FilePath
|
|
KeyIdIndex keyId associatedFile
|
|
deriving Show
|
|
|
|
CachedMetaField
|
|
field MetaField
|
|
FieldIndex field
|
|
|
|
CachedMetaData
|
|
keyId CachedKeyId Eq
|
|
fieldId CachedMetaFieldId Eq
|
|
metaValue String
|
|
|
|
LastFscked
|
|
keyId CachedKeyId Eq
|
|
localFscked Int Maybe
|
|
</pre>
|
|
|
|
With this, running 1000 joins to get the associated files of 1000
|
|
Keys took 5.6s with warm cache. (When done in the same `runSqlite` call.) Ouch!
|
|
|
|
Update: This performance was fixed by adding `KeyIdOutdex keyId associatedFile`,
|
|
which adds a uniqueness constraint on the tuple of key and associatedFile.
|
|
With this, 1000 queries takes 0.406s. Note that persistent is probably not
|
|
actually doing a join at the SQL level, so this could be sped up using
|
|
eg, esquelito.
|
|
|
|
Update2: Using esquelito to do a join got this down to 0.109s.
|
|
See `database` branch for code.
|
|
|
|
Update3: Converting to a single un-normalized table for AssociatedFiles
|
|
avoids the join, and increased lookup speed to 0.087s. Of course, when
|
|
a key has multiple associated files, this will use more disk space, due
|
|
to not normalizing the key.
|
|
|
|
Compare the above with 1000 calls to `associatedFiles`, which is approximately
|
|
as fast as just opening and reading 1000 files, so will take well under
|
|
0.05s with a **cold** cache.
|
|
|
|
So, we're looking at maybe 50% slowdown using sqlite and
|
|
persistent for associated files. OTOH, the normalized schema should
|
|
perform better when adding an associated file to a key that already has many.
|
|
|
|
For metadata, the story is much nicer. Querying for 30000 keys that all
|
|
have a particular tag in their metadata takes 0.65s. So fast enough to be
|
|
used in views.
|
|
|
|
Update4: Comparing git-annex fsck using the sticky bit to the final sqlite
|
|
implementation:
|
|
|
|
sticky bit: 4m30.787s
|
|
sqlite: 4m40.789s
|