add
This commit is contained in:
parent
362d979298
commit
46aab35eb0
1 changed files with 99 additions and 0 deletions
99
doc/design/caching_database.mdwn
Normal file
99
doc/design/caching_database.mdwn
Normal file
|
@ -0,0 +1,99 @@
|
|||
* [[metadata]] for views
|
||||
* [direct mode mappings scale badly with thousands of identical files](/bugs/__34__Adding_4923_files__34___is_really_slow)
|
||||
* [[bugs/incremental_fsck_should_not_use_sticky_bit]]
|
||||
|
||||
What do all these have in common? They could all be improved by
|
||||
using some kind of database to locally store the information in an
|
||||
efficient way.
|
||||
|
||||
The database should only function as a cache. It should be able to be
|
||||
generated and updated by looking at the git repository.
|
||||
|
||||
* Metadata can be updated by looking at the git-annex branch,
|
||||
either its current state, or the diff between the old and new versions
|
||||
* Direct mode mappings can be updated by looking at the current branch,
|
||||
to see which files map to which key. Or the diff between the old
|
||||
and new versions of the branch.
|
||||
* Incremental fsck information is not stored in git, but can be
|
||||
"regenerated" by running fsck again.
|
||||
(Perhaps doesn't quite fit, but let it slide..)
|
||||
|
||||
## case study: persistent with sqllite
|
||||
|
||||
Here's a non-normalized database schema in persistent's syntax.
|
||||
|
||||
<pre>
|
||||
CachedKey
|
||||
key Key
|
||||
associatedFiles [FilePath]
|
||||
lastFscked Int Maybe
|
||||
KeyIndex key
|
||||
|
||||
CachedMetaData
|
||||
key Key
|
||||
metaDataField MetaDataField
|
||||
metaDataValue MetaDataValue
|
||||
</pre>
|
||||
|
||||
Using the above database schema and persistent with sqlite, I made
|
||||
a database containing 30k Cache records. This took 5 seconds to create
|
||||
and was 7 mb on disk. (Would be rather smaller, if a more packed Key
|
||||
show/read instance were used.)
|
||||
|
||||
Running 1000 separate queries to get 1000 CachedKeys took 0.688s with warm
|
||||
cache. This was more than halved when all 1000 queries were done inside the
|
||||
same `runSqlite` call. (Which could be done using a separate thread and some
|
||||
MVars.)
|
||||
|
||||
(Note that if the database is a cache, there is no need to perform migrations
|
||||
when querying it. My benchmarks skip `runMigration`. Instead, if the query
|
||||
fails, the database doesn't exist, or uses an incompatable schema, and the
|
||||
cache can be rebuilt then. This avoids the problem that persistent's migrations
|
||||
can sometimes fail.)
|
||||
|
||||
Doubling the db to 60k scaled linearly in disk and cpu and did not affect
|
||||
query time.
|
||||
|
||||
----
|
||||
|
||||
Here's a normalized schema:
|
||||
|
||||
<pre>
|
||||
CachedKey
|
||||
key Key
|
||||
KeyIndex key
|
||||
deriving Show
|
||||
|
||||
AssociatedFiles
|
||||
keyId CachedKeyId Eq
|
||||
associatedFile FilePath
|
||||
deriving Show
|
||||
|
||||
CachedMetaField
|
||||
field MetaField
|
||||
FieldIndex field
|
||||
|
||||
CachedMetaData
|
||||
keyId CachedKeyId Eq
|
||||
fieldId CachedMetaFieldId Eq
|
||||
metaValue String
|
||||
|
||||
LastFscked
|
||||
keyId CachedKeyId Eq
|
||||
localFscked Int Maybe
|
||||
</pre>
|
||||
|
||||
With this, running 1000 joins to get the associated files of 1000
|
||||
Keys took 5.6s with warm cache. (When done in the same `runSqlite` call.) Ouch!
|
||||
|
||||
Compare the above with 1000 calls to `associatedFiles`, which is approximately
|
||||
as fast as just opening and reading 1000 files, so will take well under
|
||||
0.05s with a **cold** cache.
|
||||
|
||||
So, we're looking at nearly an order of magnitude slowdown using sqlite and
|
||||
persistent for associated files. OTOH, the normalized schema should
|
||||
perform better when adding an associated file to a key that already has many.
|
||||
|
||||
For metadata, the story is much nicer. Querying for 30000 keys that all
|
||||
have a particular tag in their metadata takes 0.65s. So fast enough to be
|
||||
used in views.
|
Loading…
Reference in a new issue