122 lines
4 KiB
Markdown
122 lines
4 KiB
Markdown
* [[metadata]] for views
|
|
* [direct mode mappings scale badly with thousands of identical files](/bugs/__34__Adding_4923_files__34___is_really_slow)
|
|
* [[bugs/incremental_fsck_should_not_use_sticky_bit]]
|
|
|
|
What do all these have in common? They could all be improved by
|
|
using some kind of database to locally store the information in an
|
|
efficient way.
|
|
|
|
The database should only function as a cache. It should be able to be
|
|
generated and updated by looking at the git repository.
|
|
|
|
* Metadata can be updated by looking at the git-annex branch,
|
|
either its current state, or the diff between the old and new versions
|
|
* Direct mode mappings can be updated by looking at the current branch,
|
|
to see which files map to which key. Or the diff between the old
|
|
and new versions of the branch.
|
|
* Incremental fsck information is not stored in git, but can be
|
|
"regenerated" by running fsck again.
|
|
(Perhaps doesn't quite fit, but let it slide..)
|
|
|
|
Store in the database the Ref of the branch that was used to construct it.
|
|
(Update in same transaction as cached data.)
|
|
|
|
## implementation plan
|
|
|
|
1. Implement for metadata, on a branch, with sqlite.
|
|
2. Make sure that builds on all platforms.
|
|
3. Add associated file mappings support. This is needed to fully
|
|
use the caching database to construct views.
|
|
4. Store incremental fsck info in db.
|
|
5. Replace .map files with 3. for direct mode.
|
|
|
|
## case study: persistent with sqllite
|
|
|
|
Here's a non-normalized database schema in persistent's syntax.
|
|
|
|
<pre>
|
|
CachedKey
|
|
key Key
|
|
associatedFiles [FilePath]
|
|
lastFscked Int Maybe
|
|
KeyIndex key
|
|
|
|
CachedMetaData
|
|
key Key
|
|
metaDataField MetaDataField
|
|
metaDataValue MetaDataValue
|
|
</pre>
|
|
|
|
Using the above database schema and persistent with sqlite, I made
|
|
a database containing 30k Cache records. This took 5 seconds to create
|
|
and was 7 mb on disk. (Would be rather smaller, if a more packed Key
|
|
show/read instance were used.)
|
|
|
|
Running 1000 separate queries to get 1000 CachedKeys took 0.688s with warm
|
|
cache. This was more than halved when all 1000 queries were done inside the
|
|
same `runSqlite` call. (Which could be done using a separate thread and some
|
|
MVars.)
|
|
|
|
(Note that if the database is a cache, there is no need to perform migrations
|
|
when querying it. My benchmarks skip `runMigration`. Instead, if the query
|
|
fails, the database doesn't exist, or uses an incompatable schema, and the
|
|
cache can be rebuilt then. This avoids the problem that persistent's migrations
|
|
can sometimes fail.)
|
|
|
|
Doubling the db to 60k scaled linearly in disk and cpu and did not affect
|
|
query time.
|
|
|
|
----
|
|
|
|
Here's a normalized schema:
|
|
|
|
<pre>
|
|
CachedKey
|
|
key Key
|
|
KeyIndex key
|
|
deriving Show
|
|
|
|
AssociatedFiles
|
|
keyId CachedKeyId Eq
|
|
associatedFile FilePath
|
|
KeyIdIndex keyId associatedFile
|
|
deriving Show
|
|
|
|
CachedMetaField
|
|
field MetaField
|
|
FieldIndex field
|
|
|
|
CachedMetaData
|
|
keyId CachedKeyId Eq
|
|
fieldId CachedMetaFieldId Eq
|
|
metaValue String
|
|
|
|
LastFscked
|
|
keyId CachedKeyId Eq
|
|
localFscked Int Maybe
|
|
</pre>
|
|
|
|
With this, running 1000 joins to get the associated files of 1000
|
|
Keys took 5.6s with warm cache. (When done in the same `runSqlite` call.) Ouch!
|
|
|
|
Update: This performance was fixed by adding `KeyIdOutdex keyId associatedFile`,
|
|
which adds a uniqueness constraint on the tuple of key and associatedFile.
|
|
With this, 1000 queries takes 0.406s. Note that persistent is probably not
|
|
actually doing a join at the SQL level, so this could be sped up using
|
|
eg, esquelito.
|
|
|
|
Update2: Using esquelito to do a join got this down to 0.250s.
|
|
|
|
Code: <http://lpaste.net/101141> <http://lpaste.net/101142>
|
|
|
|
Compare the above with 1000 calls to `associatedFiles`, which is approximately
|
|
as fast as just opening and reading 1000 files, so will take well under
|
|
0.05s with a **cold** cache.
|
|
|
|
So, we're looking at nearly an order of magnitude slowdown using sqlite and
|
|
persistent for associated files. OTOH, the normalized schema should
|
|
perform better when adding an associated file to a key that already has many.
|
|
|
|
For metadata, the story is much nicer. Querying for 30000 keys that all
|
|
have a particular tag in their metadata takes 0.65s. So fast enough to be
|
|
used in views.
|