git-annex/doc/todo/sqlite_database_improvements.mdwn

The `sqlite` branch changes the databases, updating annex.version to 8.
The branch is ready to merge, but it might be deferred until Q1 2020
to avoid user upgrade fatigue. Discussion below is about the motivation for
these changes.

----

Collection of non-ideal things about git-annex's use of sqlite databases.
Would be good to improve these sometime, but it would need a migration
process.


* Database.Keys.SQL.isInodeKnown has some really ugly SQL LIKE queries.
  Probably an index would not speed them up. They're only needed when
  git-annex detects inodes are not stable, eg on fat or probably windows.
  A better database
  schema should be able to eliminate the need for those LIKE queries.
  Eg, store the size and allowable mtimes in a separate table that is
  queried when necessary.

  Fixed.

* Several selects were not able to use indexes, so would be slow.

  Fixed by adding indexes.

* Database.Types has some suboptimal encodings for Key and InodeCache.
  They are both slow due to being implemented using String
  (which may be fixable w/o changing the DB schema),
  and the VARCHARs they generate are longer than necessary
  since they look like eg `SKey "whatever"` and `I "whatever"`

  Fixed.

* SFilePath is stored efficiently, and has to be a String anyway,
  (until ByteStringFilePath is used)
  but since it's stored as a VARCHAR, which sqlite interprets using the
  current locale, there can be encoding problems. This is at least worked
  around with a hack that escapes FilePaths that contain unusual
  characters. It would be much better to use a BLOB.

  Also, when LANG=C is sometimes used, the hack can result in duplicates with
  different representations of the same filename, like this:

	INSERT INTO associated VALUES(4,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a,'foo.ü');
	INSERT INTO associated VALUES(5,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a','"foo.\56515\56508"');

  See <http://git-annex.branchable.com/bugs/assistant_crashes_in_TransferScanner/>
  for an example of how this can happen.

  And it seems likely that a query by filename would fail if the filename
  was in the database but with a different encoding.

  Fixed by converting to blob.

* SKey and IKey could fail to round-trip as well, when a Key contains something
  (eg, a filename extension) that is not valid in the current locale,
  for similar reasons to SFilePath. Using BLOB would be better.

  See [[!commit cf260d9a159050e2a7e70394fdd8db289c805ec3]] for details
  about the encoding problem for SFilePath. I reproduced a similar problem
  for IKey by making a file `foo.ü` and running `git add` on  it in a unicode
  locale. Then with LANG=C, `git annex drop --force foo.ü` thinks
  it drops the content, but in fact the work tree file is left containing
  the dropped content. The database then contained:

	INSERT INTO associated VALUES(8,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.ü','foo.ü');
	INSERT INTO associated VALUES(9,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.<2E><>','"foo.\56515\56508"');

  Fixed by converting to blob.

* migration

> Investigated this in more detail, and I can't find a way to
> solve the encoding problem other than changing the encoding
> SKey, IKey, and SFilePath in a non-backwards-compatible way.
>
> Probably the encoding problem is actually not in sqlite, but
> in persistent's use of Text internally. I did some tests with sqlite3
> command and it did not seem to vary query results based on the locale
> when using VARCHAR values. I was able to successfully insert an
> invalid unicode `ff` byte into it, and get the same byte back out.
>
> Unfortunately, it's not possible to make persistent not use Text
> for VARCHAR. While its PersistDbSpecific lets a non-Text value be stored
> as VARCHAR, any VARCHAR value coming out of the database gets converted
> to a PersistText.
>
> So that seems to leave using a BLOB to store a ByteString for
> SKey, IKey, and SFilePath. But old git-annex won't be able to
> read the updated databases, and won't know that it can't read them!
>
> This seems to call for a flag day, throwing out the old database
> contents and regenerating them from other data:
>
> * Fsck (SKey)
>   can't rebuild? Just drop and let incremental fscks re-do work
>
>   (done)

> * ContentIdentifier (IKey)
>   rebuild with updateFromLog, will need to diff from empty tree to
>   current git-annex branch, may be expensive to do!

    (done; will be done automatically by the first command that needs to
    use the db)
>
> * Export (IKey, SFilePath)
>   difficult to rebuild, what if in the middle of an interrupted
>   export?
>
>   updateExportTreeFromLog only updates two tables (ExportTree and
>   ExportTreeCurrent), not others (Exported and ExportedDirectory).
>
>   Conceptually, this is the same as the repo being lost and another
>   clone being used to update the export. The clone can only learn
>   export state from the log. It's supposed to recover from such
>   situations, the next time an export is run, so should be ok.
>
>   An interrupted export won't resume where it left off, since the
>   information about a partial export doesn't reach the git-annex branch.
>   So it will re-send some files on resume. Documenting this should
>   be good enough.
>
>   Testing this an exporting to a directory with both exporttree=yes and
>   importtree=yes, it refused to let an interrupted export proceed after
>   upgrade, with "unsafe to overwrite file". An import resolved the
>   problem. Guess the problem is that the content idenfifier did not
>   get recorded in the git-annex branch and the db value was lost on upgrade.
>
>   If the export is not interrupted before upgrade, later exports are able
>   to overwrite the exported files as they should.
>
>   Hmm, testing again, with a script, and interrupting the export, I am
>   not able to reproduce the problem either. The cids of exported files
>   get written to the journal, so make their way into the git-annex branch
>   and so the cid db gets repopulated. Perhaps my earlier manual test
>   was mistaken somehow.
>
> * Keys (IKey, SFilePath, SInodeCache)
>   Use scanUnlockedFiles to repopulate the Associated table.
>
>   But that does not repopulate the Content table. Doing so needs
>   to iterate over the unlocked files, filter out any that are modified,
>   and record the InodeCaches of the unmodified ones. Seems that it would
>   have to use git's index to know which files are modified.
>
>   There is a race; a file could be modified after getting the list of
>   modified files. To completely avoid that race is tricky. To mostly
>   eliminate it, just generate the InodeCache, then check
>   if the file is still unmodified, then check if the InodeCache is still
>   valid. That leaves some much less likely races where files are being
>   repeatedly swapped and the InodeCache generations see one file while
>   the git ls-files --modified see the other one.
>
>   To fully avoid the race, use git ls-files --cached --debug,
>   and parse the debug output into a InodeCache! This way the info
>   from git's index is simply copied over into the git-annex database.
>   One little problem: The --debug format is not specified and may change.
>   However, it has never actually changed since it was introduced in 2010
>   (git v1.8.3.1), except for a fix for an unsigned int overflow bug that
>   was fixed in April 2019.
>
>   (done)
>
> Alternatively, can keep the old database code and use it to read the old
> databases during the migration. But then bad data that got in due to the
> encoding problems will persist.

[[done]] --[[Joey]]