git-annex/doc/todo/sqlite_database_improvements.mdwn

123 lines
5.3 KiB
Text
Raw Normal View History

2019-03-06 15:17:06 +00:00
Collection of non-ideal things about git-annex's use of sqlite databases.
Would be good to improve these sometime, but it would need a migration
process.
* Database.Keys.SQL.isInodeKnown has some really ugly SQL LIKE queries.
Probably an index would not speed them up. They're only needed when
git-annex detects inodes are not stable, eg on fat or probably windows.
A better database
schema should be able to eliminate the need for those LIKE queries.
Eg, store the size and allowable mtimes in a separate table that is
queried when necessary.
2019-10-30 19:16:03 +00:00
Fixed.
* Several selects were not able to use indexes, so would be slow.
Fixed by adding indexes.
2019-03-06 15:17:06 +00:00
* Database.Types has some suboptimal encodings for Key and InodeCache.
They are both slow due to being implemented using String
(which may be fixable w/o changing the DB schema),
and the VARCHARs they generate are longer than necessary
since they look like eg `SKey "whatever"` and `I "whatever"`
Fixed.
2019-03-06 15:17:06 +00:00
* SFilePath is stored efficiently, and has to be a String anyway,
(until ByteStringFilePath is used)
but since it's stored as a VARCHAR, which sqlite interprets using the
current locale, there can be encoding problems. This is at least worked
around with a hack that escapes FilePaths that contain unusual
characters. It would be much better to use a BLOB.
2019-06-04 18:13:15 +00:00
Also, when LANG=C is sometimes used, the hack can result in duplicates with
different representations of the same filename, like this:
INSERT INTO associated VALUES(4,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a,'foo.ü');
INSERT INTO associated VALUES(5,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a','"foo.\56515\56508"');
2019-08-26 16:29:43 +00:00
See <http://git-annex.branchable.com/bugs/assistant_crashes_in_TransferScanner/>
for an example of how this can happen.
2019-06-04 18:38:55 +00:00
And it seems likely that a query by filename would fail if the filename
was in the database but with a different encoding.
Fixed by converting to blob.
2019-03-06 15:17:06 +00:00
* IKey could fail to round-trip as well, when a Key contains something
(eg, a filename extension) that is not valid in the current locale,
for similar reasons to SFilePath. Using BLOB would be better.
2019-06-04 18:13:15 +00:00
2019-06-04 18:38:55 +00:00
See [[!commit cf260d9a159050e2a7e70394fdd8db289c805ec3]] for details
about the encoding problem for SFilePath. I reproduced a similar problem
for IKey by making a file `foo.ü` and running `git add` on it in a unicode
2019-06-04 18:13:15 +00:00
locale. Then with LANG=C, `git annex drop --force foo.ü` thinks
it drops the content, but in fact the work tree file is left containing
the dropped content. The database then contained:
INSERT INTO associated VALUES(8,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.ü','foo.ü');
INSERT INTO associated VALUES(9,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.<2E><>','"foo.\56515\56508"');
Fixed by converting to blob.
----
2019-10-30 19:31:16 +00:00
remaining todo:
* migration
2019-06-04 18:13:15 +00:00
> Investigated this in more detail, and I can't find a way to
> solve the encoding problem other than changing the encoding
> SKey, IKey, and SFilePath in a non-backwards-compatible way.
>
> Probably the encoding problem is actually not in sqlite, but
> in persistent's use of Text internally. I did some tests with sqlite3
> command and it did not seem to vary query results based on the locale
> when using VARCHAR values. I was able to successfully insert an
> invalid unicode `ff` byte into it, and get the same byte back out.
2019-06-04 18:13:15 +00:00
>
> Unfortunately, it's not possible to make persistent not use Text
> for VARCHAR. While its PersistDbSpecific lets a non-Text value be stored
> as VARCHAR, any VARCHAR value coming out of the database gets converted
> to a PersistText.
>
> So that seems to leave using a BLOB to store a ByteString for
> SKey, IKey, and SFilePath. Attached patch shows how to do that,
> but old git-annex won't be able to read the updated databases,
> and won't know that it can't read them!
2019-06-04 18:13:15 +00:00
>
> This seems to call for a flag day, throwing out the old database
> contents and regenerating them from other data:
>
> * Fsck (SKey)
> can't rebuild? Just drop and let incremental fscks re-do work
> * ContentIdentifier (IKey)
> rebuild with updateFromLog, would need to diff from empty tree to
> current git-annex branch, may be expensive to do!
> * Export (IKey, SFilePath)
> difficult to rebuild, what if in the middle of an interrupted
> export?
>
> updateExportTreeFromLog only updates two tables, not others
>
> Conceptually, this is the same as the repo being lost and another
> clone being used to update the export. The clone can only learn
> export state from the log. It's supposed to recover from such
> situations, the next time an export is run, so should be ok.
> But it might result in already exported files being re-uploaded,
> or other unncessary work.
> Keys (IKey, SFilePath)
> rebuild with scanUnlockedFiles
>
> does that update the Content table with the InodeCache?
>
> But after such a transition, how to communicate to the old git-annex
> that it can't use the databases any longer? Moving the databases
> out of the way won't do; old git-annex will just recreate them and
> start with missing data!
>
> And, what about users who use a mix of old and new git-annex versions?
2019-06-04 18:13:15 +00:00
>
> Seems this needs an annex.version bump from v7 to v8.