add

2014-03-12 18:05:22 -04:00 · 2014-03-12 18:05:22 -04:00 · 46aab35eb0
commit 46aab35eb0
parent 362d979298
1 changed files with 99 additions and 0 deletions
--- a/doc/design/caching_database.mdwn
+++ b/doc/design/caching_database.mdwn
@ -0,0 +1,99 @@
+* [[metadata]] for views
+* [direct mode mappings scale badly with thousands of identical files](/bugs/__34__Adding_4923_files__34___is_really_slow)
+* [[bugs/incremental_fsck_should_not_use_sticky_bit]]
+
+What do all these have in common? They could all be improved by
+using some kind of database to locally store the information in an
+efficient way.
+
+The database should only function as a cache. It should be able to be
+generated and updated by looking at the git repository.
+
+* Metadata can be updated by looking at the git-annex branch,
+  either its current state, or the diff between the old and new versions
+* Direct mode mappings can be updated by looking at the current branch,
+  to see which files map to which key. Or the diff between the old
+  and new versions of the branch.
+* Incremental fsck information is not stored in git, but can be
+  "regenerated" by running fsck again.  
+  (Perhaps doesn't quite fit, but let it slide..)
+
+## case study: persistent with sqllite
+
+Here's a non-normalized database schema in persistent's syntax.
+
+<pre>
+CachedKey
+  key Key
+  associatedFiles [FilePath]
+  lastFscked Int Maybe
+  KeyIndex key
+
+CachedMetaData
+  key Key
+  metaDataField MetaDataField
+  metaDataValue MetaDataValue
+</pre>
+
+Using the above database schema and persistent with sqlite, I made
+a database containing 30k Cache records. This took 5 seconds to create
+and was 7 mb on disk. (Would be rather smaller, if a more packed Key
+show/read instance were used.)
+
+Running 1000 separate queries to get 1000 CachedKeys took 0.688s with warm
+cache. This was more than halved when all 1000 queries were done inside the
+same `runSqlite` call. (Which could be done using a separate thread and some
+MVars.)
+
+(Note that if the database is a cache, there is no need to perform migrations
+when querying it. My benchmarks skip `runMigration`. Instead, if the query
+fails, the database doesn't exist, or uses an incompatable schema, and the
+cache can be rebuilt then. This avoids the problem that persistent's migrations
+can sometimes fail.)
+
+Doubling the db to 60k scaled linearly in disk and cpu and did not affect
+query time.
+
+----
+
+Here's a normalized schema:
+
+<pre>
+CachedKey
+  key Key
+  KeyIndex key
+  deriving Show
+
+AssociatedFiles
+  keyId CachedKeyId Eq
+  associatedFile FilePath
+  deriving Show
+
+CachedMetaField
+  field MetaField
+  FieldIndex field
+
+CachedMetaData
+  keyId CachedKeyId Eq
+  fieldId CachedMetaFieldId Eq
+  metaValue String
+
+LastFscked
+  keyId CachedKeyId Eq
+  localFscked Int Maybe
+</pre>
+
+With this, running 1000 joins to get the associated files of 1000
+Keys took 5.6s with warm cache. (When done in the same `runSqlite` call.) Ouch!
+
+Compare the above with 1000 calls to `associatedFiles`, which is approximately
+as fast as just opening and reading 1000 files, so will take well under
+0.05s with a **cold** cache.
+
+So, we're looking at nearly an order of magnitude slowdown using sqlite and
+persistent for associated files. OTOH, the normalized schema should
+perform better when adding an associated file to a key that already has many.
+
+For metadata, the story is much nicer. Querying for 30000 keys that all
+have a particular tag in their metadata takes 0.65s. So fast enough to be
+used in views.