d3e4de0175
The test suite found a bug; select_ can fail now because a uniqueness constrain has been added. Now the test suite passes. Also, I'm satisfied the changed PersistField instances work. Looking over what changed, and what I've already tested, Key, FilePath, and InodeCache are known working; ContentIdentifier is trivial ByteString to blob; and SSha is trivial String to varchar. Both are tested by the test suite. I've also tested the new FileSize and EpochTime instances already, and they work.
122 lines
5.3 KiB
Markdown
122 lines
5.3 KiB
Markdown
Collection of non-ideal things about git-annex's use of sqlite databases.
|
||
Would be good to improve these sometime, but it would need a migration
|
||
process.
|
||
|
||
* Database.Keys.SQL.isInodeKnown has some really ugly SQL LIKE queries.
|
||
Probably an index would not speed them up. They're only needed when
|
||
git-annex detects inodes are not stable, eg on fat or probably windows.
|
||
A better database
|
||
schema should be able to eliminate the need for those LIKE queries.
|
||
Eg, store the size and allowable mtimes in a separate table that is
|
||
queried when necessary.
|
||
|
||
Fixed.
|
||
|
||
* Several selects were not able to use indexes, so would be slow.
|
||
|
||
Fixed by adding indexes.
|
||
|
||
* Database.Types has some suboptimal encodings for Key and InodeCache.
|
||
They are both slow due to being implemented using String
|
||
(which may be fixable w/o changing the DB schema),
|
||
and the VARCHARs they generate are longer than necessary
|
||
since they look like eg `SKey "whatever"` and `I "whatever"`
|
||
|
||
Fixed.
|
||
|
||
* SFilePath is stored efficiently, and has to be a String anyway,
|
||
(until ByteStringFilePath is used)
|
||
but since it's stored as a VARCHAR, which sqlite interprets using the
|
||
current locale, there can be encoding problems. This is at least worked
|
||
around with a hack that escapes FilePaths that contain unusual
|
||
characters. It would be much better to use a BLOB.
|
||
|
||
Also, when LANG=C is sometimes used, the hack can result in duplicates with
|
||
different representations of the same filename, like this:
|
||
|
||
INSERT INTO associated VALUES(4,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a,'foo.ü');
|
||
INSERT INTO associated VALUES(5,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a','"foo.\56515\56508"');
|
||
|
||
See <http://git-annex.branchable.com/bugs/assistant_crashes_in_TransferScanner/>
|
||
for an example of how this can happen.
|
||
|
||
And it seems likely that a query by filename would fail if the filename
|
||
was in the database but with a different encoding.
|
||
|
||
Fixed by converting to blob.
|
||
|
||
* IKey could fail to round-trip as well, when a Key contains something
|
||
(eg, a filename extension) that is not valid in the current locale,
|
||
for similar reasons to SFilePath. Using BLOB would be better.
|
||
|
||
See [[!commit cf260d9a159050e2a7e70394fdd8db289c805ec3]] for details
|
||
about the encoding problem for SFilePath. I reproduced a similar problem
|
||
for IKey by making a file `foo.ü` and running `git add` on it in a unicode
|
||
locale. Then with LANG=C, `git annex drop --force foo.ü` thinks
|
||
it drops the content, but in fact the work tree file is left containing
|
||
the dropped content. The database then contained:
|
||
|
||
INSERT INTO associated VALUES(8,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.ü','foo.ü');
|
||
INSERT INTO associated VALUES(9,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.<2E><>','"foo.\56515\56508"');
|
||
|
||
Fixed by converting to blob.
|
||
|
||
----
|
||
|
||
remaining todo:
|
||
|
||
* migration
|
||
|
||
> Investigated this in more detail, and I can't find a way to
|
||
> solve the encoding problem other than changing the encoding
|
||
> SKey, IKey, and SFilePath in a non-backwards-compatible way.
|
||
>
|
||
> Probably the encoding problem is actually not in sqlite, but
|
||
> in persistent's use of Text internally. I did some tests with sqlite3
|
||
> command and it did not seem to vary query results based on the locale
|
||
> when using VARCHAR values. I was able to successfully insert an
|
||
> invalid unicode `ff` byte into it, and get the same byte back out.
|
||
>
|
||
> Unfortunately, it's not possible to make persistent not use Text
|
||
> for VARCHAR. While its PersistDbSpecific lets a non-Text value be stored
|
||
> as VARCHAR, any VARCHAR value coming out of the database gets converted
|
||
> to a PersistText.
|
||
>
|
||
> So that seems to leave using a BLOB to store a ByteString for
|
||
> SKey, IKey, and SFilePath. Attached patch shows how to do that,
|
||
> but old git-annex won't be able to read the updated databases,
|
||
> and won't know that it can't read them!
|
||
>
|
||
> This seems to call for a flag day, throwing out the old database
|
||
> contents and regenerating them from other data:
|
||
>
|
||
> * Fsck (SKey)
|
||
> can't rebuild? Just drop and let incremental fscks re-do work
|
||
> * ContentIdentifier (IKey)
|
||
> rebuild with updateFromLog, would need to diff from empty tree to
|
||
> current git-annex branch, may be expensive to do!
|
||
> * Export (IKey, SFilePath)
|
||
> difficult to rebuild, what if in the middle of an interrupted
|
||
> export?
|
||
>
|
||
> updateExportTreeFromLog only updates two tables, not others
|
||
>
|
||
> Conceptually, this is the same as the repo being lost and another
|
||
> clone being used to update the export. The clone can only learn
|
||
> export state from the log. It's supposed to recover from such
|
||
> situations, the next time an export is run, so should be ok.
|
||
> But it might result in already exported files being re-uploaded,
|
||
> or other unncessary work.
|
||
> Keys (IKey, SFilePath)
|
||
> rebuild with scanUnlockedFiles
|
||
>
|
||
> does that update the Content table with the InodeCache?
|
||
>
|
||
> But after such a transition, how to communicate to the old git-annex
|
||
> that it can't use the databases any longer? Moving the databases
|
||
> out of the way won't do; old git-annex will just recreate them and
|
||
> start with missing data!
|
||
>
|
||
> And, what about users who use a mix of old and new git-annex versions?
|
||
>
|
||
> Seems this needs an annex.version bump from v7 to v8.
|