diff --git a/doc/design/caching_database.mdwn b/doc/design/caching_database.mdwn new file mode 100644 index 0000000000..00a65d4b1c --- /dev/null +++ b/doc/design/caching_database.mdwn @@ -0,0 +1,99 @@ +* [[metadata]] for views +* [direct mode mappings scale badly with thousands of identical files](/bugs/__34__Adding_4923_files__34___is_really_slow) +* [[bugs/incremental_fsck_should_not_use_sticky_bit]] + +What do all these have in common? They could all be improved by +using some kind of database to locally store the information in an +efficient way. + +The database should only function as a cache. It should be able to be +generated and updated by looking at the git repository. + +* Metadata can be updated by looking at the git-annex branch, + either its current state, or the diff between the old and new versions +* Direct mode mappings can be updated by looking at the current branch, + to see which files map to which key. Or the diff between the old + and new versions of the branch. +* Incremental fsck information is not stored in git, but can be + "regenerated" by running fsck again. + (Perhaps doesn't quite fit, but let it slide..) + +## case study: persistent with sqllite + +Here's a non-normalized database schema in persistent's syntax. + +
+CachedKey + key Key + associatedFiles [FilePath] + lastFscked Int Maybe + KeyIndex key + +CachedMetaData + key Key + metaDataField MetaDataField + metaDataValue MetaDataValue ++ +Using the above database schema and persistent with sqlite, I made +a database containing 30k Cache records. This took 5 seconds to create +and was 7 mb on disk. (Would be rather smaller, if a more packed Key +show/read instance were used.) + +Running 1000 separate queries to get 1000 CachedKeys took 0.688s with warm +cache. This was more than halved when all 1000 queries were done inside the +same `runSqlite` call. (Which could be done using a separate thread and some +MVars.) + +(Note that if the database is a cache, there is no need to perform migrations +when querying it. My benchmarks skip `runMigration`. Instead, if the query +fails, the database doesn't exist, or uses an incompatable schema, and the +cache can be rebuilt then. This avoids the problem that persistent's migrations +can sometimes fail.) + +Doubling the db to 60k scaled linearly in disk and cpu and did not affect +query time. + +---- + +Here's a normalized schema: + +
+CachedKey + key Key + KeyIndex key + deriving Show + +AssociatedFiles + keyId CachedKeyId Eq + associatedFile FilePath + deriving Show + +CachedMetaField + field MetaField + FieldIndex field + +CachedMetaData + keyId CachedKeyId Eq + fieldId CachedMetaFieldId Eq + metaValue String + +LastFscked + keyId CachedKeyId Eq + localFscked Int Maybe ++ +With this, running 1000 joins to get the associated files of 1000 +Keys took 5.6s with warm cache. (When done in the same `runSqlite` call.) Ouch! + +Compare the above with 1000 calls to `associatedFiles`, which is approximately +as fast as just opening and reading 1000 files, so will take well under +0.05s with a **cold** cache. + +So, we're looking at nearly an order of magnitude slowdown using sqlite and +persistent for associated files. OTOH, the normalized schema should +perform better when adding an associated file to a key that already has many. + +For metadata, the story is much nicer. Querying for 30000 keys that all +have a particular tag in their metadata takes 0.65s. So fast enough to be +used in views.