git-annex/doc/design/caching_database.mdwn

* [[metadata]] for views
* [direct mode mappings scale badly with thousands of identical files](/bugs/__34__Adding_4923_files__34___is_really_slow)
* [[bugs/incremental_fsck_should_not_use_sticky_bit]]
* [[todo/wishlist:_pack_metadata_in_direct_mode]]

What do all these have in common? They could all be improved by
using some kind of database to locally store the information in an
efficient way.

The database should only function as a cache. It should be able to be
generated and updated by looking at the git repository.

* Metadata can be updated by looking at the git-annex branch,
  either its current state, or the diff between the old and new versions
* Direct mode mappings can be updated by looking at the current branch,
  to see which files map to which key. Or the diff between the old
  and new versions of the branch.
* Incremental fsck information is not stored in git, but can be
  "regenerated" by running fsck again.  
  (Perhaps doesn't quite fit, but let it slide..)

Store in the database the Ref of the branch that was used to construct it.
(Update in same transaction as cached data.)

## implementation plan

1. Implement for metadata, on a branch, with sqlite.
2. Make sure that builds on all platforms.
3. Add associated file mappings support. This is needed to fully
   use the caching database to construct views.
4. Store incremental fsck info in db.
5. Replace .map files with 3. for direct mode.

## case study: persistent with sqllite

Here's a non-normalized database schema in persistent's syntax.

<pre>
CachedKey
  key Key
  associatedFiles [FilePath]
  lastFscked Int Maybe
  KeyIndex key

CachedMetaData
  key Key
  metaDataField MetaDataField
  metaDataValue MetaDataValue
</pre>

Using the above database schema and persistent with sqlite, I made
a database containing 30k Cache records. This took 5 seconds to create
and was 7 mb on disk. (Would be rather smaller, if a more packed Key
show/read instance were used.)

Running 1000 separate queries to get 1000 CachedKeys took 0.688s with warm
cache. This was more than halved when all 1000 queries were done inside the
same `runSqlite` call. (Which could be done using a separate thread and some
MVars.)

(Note that if the database is a cache, there is no need to perform migrations
when querying it. My benchmarks skip `runMigration`. Instead, if the query
fails, the database doesn't exist, or uses an incompatable schema, and the
cache can be rebuilt then. This avoids the problem that persistent's migrations
can sometimes fail.)

Doubling the db to 60k scaled linearly in disk and cpu and did not affect
query time.

----

Here's a normalized schema:

<pre>
CachedKey
  key Key
  KeyIndex key
  deriving Show

AssociatedFiles
  keyId CachedKeyId Eq
  associatedFile FilePath
  KeyIdIndex keyId associatedFile
  deriving Show

CachedMetaField
  field MetaField
  FieldIndex field

CachedMetaData
  keyId CachedKeyId Eq
  fieldId CachedMetaFieldId Eq
  metaValue String

LastFscked
  keyId CachedKeyId Eq
  localFscked Int Maybe
</pre>

With this, running 1000 joins to get the associated files of 1000
Keys took 5.6s with warm cache. (When done in the same `runSqlite` call.) Ouch!

Update: This performance was fixed by adding `KeyIdOutdex keyId associatedFile`,
which adds a uniqueness constraint on the tuple of key and associatedFile.
With this, 1000 queries takes 0.406s. Note that persistent is probably not
actually doing a join at the SQL level, so this could be sped up using
eg, esquelito.

Update2: Using esquelito to do a join got this down to 0.250s.

Code: <http://lpaste.net/101141> <http://lpaste.net/101142>

Compare the above with 1000 calls to `associatedFiles`, which is approximately
as fast as just opening and reading 1000 files, so will take well under
0.05s with a **cold** cache.

So, we're looking at nearly an order of magnitude slowdown using sqlite and
persistent for associated files. OTOH, the normalized schema should
perform better when adding an associated file to a key that already has many.

For metadata, the story is much nicer. Querying for 30000 keys that all
have a particular tag in their metadata takes 0.65s. So fast enough to be
used in views.
add 2014-03-12 22:05:22 +00:00			`* [[metadata]] for views`
			`* [direct mode mappings scale badly with thousands of identical files](/bugs/__34__Adding_4923_files__34___is_really_slow)`
			`* [[bugs/incremental_fsck_should_not_use_sticky_bit]]`
link to another item 2014-03-18 19:31:41 +00:00			`* [[todo/wishlist:_pack_metadata_in_direct_mode]]`
add 2014-03-12 22:05:22 +00:00
			`What do all these have in common? They could all be improved by`
			`using some kind of database to locally store the information in an`
			`efficient way.`

			`The database should only function as a cache. It should be able to be`
			`generated and updated by looking at the git repository.`

			`* Metadata can be updated by looking at the git-annex branch,`
			`either its current state, or the diff between the old and new versions`
			`* Direct mode mappings can be updated by looking at the current branch,`
			`to see which files map to which key. Or the diff between the old`
			`and new versions of the branch.`
			`* Incremental fsck information is not stored in git, but can be`
			`"regenerated" by running fsck again.`
			`(Perhaps doesn't quite fit, but let it slide..)`

doubled speed with esqeleto 2014-03-13 15:09:05 +00:00			`Store in the database the Ref of the branch that was used to construct it.`
			`(Update in same transaction as cached data.)`

implemntation plan 2014-03-13 23:37:41 +00:00			`## implementation plan`

			`1. Implement for metadata, on a branch, with sqlite.`
			`2. Make sure that builds on all platforms.`
			`3. Add associated file mappings support. This is needed to fully`
			`use the caching database to construct views.`
			`4. Store incremental fsck info in db.`
			`5. Replace .map files with 3. for direct mode.`

add 2014-03-12 22:05:22 +00:00			`## case study: persistent with sqllite`

			`Here's a non-normalized database schema in persistent's syntax.`

			`<pre>`
			`CachedKey`
			`key Key`
			`associatedFiles [FilePath]`
			`lastFscked Int Maybe`
			`KeyIndex key`

			`CachedMetaData`
			`key Key`
			`metaDataField MetaDataField`
			`metaDataValue MetaDataValue`
			`</pre>`

			`Using the above database schema and persistent with sqlite, I made`
			`a database containing 30k Cache records. This took 5 seconds to create`
			`and was 7 mb on disk. (Would be rather smaller, if a more packed Key`
			`show/read instance were used.)`

			`Running 1000 separate queries to get 1000 CachedKeys took 0.688s with warm`
			`cache. This was more than halved when all 1000 queries were done inside the`
			same `runSqlite` call. (Which could be done using a separate thread and some
			`MVars.)`

			`(Note that if the database is a cache, there is no need to perform migrations`
			when querying it. My benchmarks skip `runMigration`. Instead, if the query
			`fails, the database doesn't exist, or uses an incompatable schema, and the`
			`cache can be rebuilt then. This avoids the problem that persistent's migrations`
			`can sometimes fail.)`

			`Doubling the db to 60k scaled linearly in disk and cpu and did not affect`
			`query time.`

			`----`

			`Here's a normalized schema:`

			`<pre>`
			`CachedKey`
			`key Key`
			`KeyIndex key`
			`deriving Show`

			`AssociatedFiles`
			`keyId CachedKeyId Eq`
			`associatedFile FilePath`
fixed slow query on normalized table; still 10x slower than current .map files 2014-03-13 13:38:20 +00:00			`KeyIdIndex keyId associatedFile`
add 2014-03-12 22:05:22 +00:00			`deriving Show`

			`CachedMetaField`
			`field MetaField`
			`FieldIndex field`

			`CachedMetaData`
			`keyId CachedKeyId Eq`
			`fieldId CachedMetaFieldId Eq`
			`metaValue String`

			`LastFscked`
			`keyId CachedKeyId Eq`
			`localFscked Int Maybe`
			`</pre>`

			`With this, running 1000 joins to get the associated files of 1000`
			Keys took 5.6s with warm cache. (When done in the same `runSqlite` call.) Ouch!

fixed slow query on normalized table; still 10x slower than current .map files 2014-03-13 13:38:20 +00:00			Update: This performance was fixed by adding `KeyIdOutdex keyId associatedFile`,
			`which adds a uniqueness constraint on the tuple of key and associatedFile.`
			`With this, 1000 queries takes 0.406s. Note that persistent is probably not`
			`actually doing a join at the SQL level, so this could be sped up using`
			`eg, esquelito.`

doubled speed with esqeleto 2014-03-13 15:09:05 +00:00			`Update2: Using esquelito to do a join got this down to 0.250s.`

			`Code: <http://lpaste.net/101141> <http://lpaste.net/101142>`

add 2014-03-12 22:05:22 +00:00			Compare the above with 1000 calls to `associatedFiles`, which is approximately
			`as fast as just opening and reading 1000 files, so will take well under`
			`0.05s with a cold cache.`

			`So, we're looking at nearly an order of magnitude slowdown using sqlite and`
			`persistent for associated files. OTOH, the normalized schema should`
			`perform better when adding an associated file to a key that already has many.`

			`For metadata, the story is much nicer. Querying for 30000 keys that all`
			`have a particular tag in their metadata takes 0.65s. So fast enough to be`
			`used in views.`