2019-11-07 17:26:05 +00:00
|
|
|
|
The `sqlite` branch changes the databases, updating annex.version to 8.
|
|
|
|
|
The branch is ready to merge, but it might be deferred until Q1 2020
|
|
|
|
|
to avoid user upgrade fatigue. Discussion below is about the motivation for
|
|
|
|
|
these changes.
|
|
|
|
|
|
|
|
|
|
----
|
|
|
|
|
|
2019-03-06 15:17:06 +00:00
|
|
|
|
Collection of non-ideal things about git-annex's use of sqlite databases.
|
|
|
|
|
Would be good to improve these sometime, but it would need a migration
|
|
|
|
|
process.
|
|
|
|
|
|
2019-10-23 18:06:11 +00:00
|
|
|
|
|
2019-11-05 16:50:53 +00:00
|
|
|
|
* Database.Keys.SQL.isInodeKnown has some really ugly SQL LIKE queries.
|
|
|
|
|
Probably an index would not speed them up. They're only needed when
|
|
|
|
|
git-annex detects inodes are not stable, eg on fat or probably windows.
|
|
|
|
|
A better database
|
2019-10-23 18:06:11 +00:00
|
|
|
|
schema should be able to eliminate the need for those LIKE queries.
|
|
|
|
|
Eg, store the size and allowable mtimes in a separate table that is
|
|
|
|
|
queried when necessary.
|
|
|
|
|
|
2019-11-05 16:50:53 +00:00
|
|
|
|
Fixed.
|
2019-03-06 15:17:06 +00:00
|
|
|
|
|
2019-11-05 16:50:53 +00:00
|
|
|
|
* Several selects were not able to use indexes, so would be slow.
|
|
|
|
|
|
|
|
|
|
Fixed by adding indexes.
|
2019-03-06 15:17:06 +00:00
|
|
|
|
|
|
|
|
|
* Database.Types has some suboptimal encodings for Key and InodeCache.
|
|
|
|
|
They are both slow due to being implemented using String
|
|
|
|
|
(which may be fixable w/o changing the DB schema),
|
|
|
|
|
and the VARCHARs they generate are longer than necessary
|
|
|
|
|
since they look like eg `SKey "whatever"` and `I "whatever"`
|
|
|
|
|
|
2019-11-05 16:50:53 +00:00
|
|
|
|
Fixed.
|
|
|
|
|
|
2019-03-06 15:17:06 +00:00
|
|
|
|
* SFilePath is stored efficiently, and has to be a String anyway,
|
|
|
|
|
(until ByteStringFilePath is used)
|
|
|
|
|
but since it's stored as a VARCHAR, which sqlite interprets using the
|
|
|
|
|
current locale, there can be encoding problems. This is at least worked
|
|
|
|
|
around with a hack that escapes FilePaths that contain unusual
|
|
|
|
|
characters. It would be much better to use a BLOB.
|
|
|
|
|
|
2019-06-04 18:13:15 +00:00
|
|
|
|
Also, when LANG=C is sometimes used, the hack can result in duplicates with
|
|
|
|
|
different representations of the same filename, like this:
|
|
|
|
|
|
|
|
|
|
INSERT INTO associated VALUES(4,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a,'foo.ü');
|
|
|
|
|
INSERT INTO associated VALUES(5,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a','"foo.\56515\56508"');
|
|
|
|
|
|
2019-08-26 16:29:43 +00:00
|
|
|
|
See <http://git-annex.branchable.com/bugs/assistant_crashes_in_TransferScanner/>
|
|
|
|
|
for an example of how this can happen.
|
|
|
|
|
|
2019-06-04 18:38:55 +00:00
|
|
|
|
And it seems likely that a query by filename would fail if the filename
|
|
|
|
|
was in the database but with a different encoding.
|
|
|
|
|
|
2019-11-05 16:50:53 +00:00
|
|
|
|
Fixed by converting to blob.
|
|
|
|
|
|
2019-11-06 18:23:00 +00:00
|
|
|
|
* SKey and IKey could fail to round-trip as well, when a Key contains something
|
2019-03-06 15:17:06 +00:00
|
|
|
|
(eg, a filename extension) that is not valid in the current locale,
|
|
|
|
|
for similar reasons to SFilePath. Using BLOB would be better.
|
2019-06-04 18:13:15 +00:00
|
|
|
|
|
2019-06-04 18:38:55 +00:00
|
|
|
|
See [[!commit cf260d9a159050e2a7e70394fdd8db289c805ec3]] for details
|
|
|
|
|
about the encoding problem for SFilePath. I reproduced a similar problem
|
|
|
|
|
for IKey by making a file `foo.ü` and running `git add` on it in a unicode
|
2019-06-04 18:13:15 +00:00
|
|
|
|
locale. Then with LANG=C, `git annex drop --force foo.ü` thinks
|
|
|
|
|
it drops the content, but in fact the work tree file is left containing
|
|
|
|
|
the dropped content. The database then contained:
|
|
|
|
|
|
|
|
|
|
INSERT INTO associated VALUES(8,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.ü','foo.ü');
|
|
|
|
|
INSERT INTO associated VALUES(9,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.<2E><>','"foo.\56515\56508"');
|
|
|
|
|
|
2019-11-05 16:50:53 +00:00
|
|
|
|
Fixed by converting to blob.
|
|
|
|
|
|
|
|
|
|
* migration
|
|
|
|
|
|
2019-06-04 18:13:15 +00:00
|
|
|
|
> Investigated this in more detail, and I can't find a way to
|
|
|
|
|
> solve the encoding problem other than changing the encoding
|
|
|
|
|
> SKey, IKey, and SFilePath in a non-backwards-compatible way.
|
|
|
|
|
>
|
2019-09-12 17:06:39 +00:00
|
|
|
|
> Probably the encoding problem is actually not in sqlite, but
|
|
|
|
|
> in persistent's use of Text internally. I did some tests with sqlite3
|
|
|
|
|
> command and it did not seem to vary query results based on the locale
|
|
|
|
|
> when using VARCHAR values. I was able to successfully insert an
|
|
|
|
|
> invalid unicode `ff` byte into it, and get the same byte back out.
|
2019-06-04 18:13:15 +00:00
|
|
|
|
>
|
2019-09-12 17:06:39 +00:00
|
|
|
|
> Unfortunately, it's not possible to make persistent not use Text
|
|
|
|
|
> for VARCHAR. While its PersistDbSpecific lets a non-Text value be stored
|
|
|
|
|
> as VARCHAR, any VARCHAR value coming out of the database gets converted
|
|
|
|
|
> to a PersistText.
|
|
|
|
|
>
|
|
|
|
|
> So that seems to leave using a BLOB to store a ByteString for
|
2019-11-06 18:23:00 +00:00
|
|
|
|
> SKey, IKey, and SFilePath. But old git-annex won't be able to
|
|
|
|
|
> read the updated databases, and won't know that it can't read them!
|
2019-06-04 18:13:15 +00:00
|
|
|
|
>
|
|
|
|
|
> This seems to call for a flag day, throwing out the old database
|
|
|
|
|
> contents and regenerating them from other data:
|
|
|
|
|
>
|
|
|
|
|
> * Fsck (SKey)
|
|
|
|
|
> can't rebuild? Just drop and let incremental fscks re-do work
|
2019-11-06 20:27:25 +00:00
|
|
|
|
>
|
|
|
|
|
> (done)
|
|
|
|
|
|
2019-06-04 18:13:15 +00:00
|
|
|
|
> * ContentIdentifier (IKey)
|
2019-11-06 20:43:52 +00:00
|
|
|
|
> rebuild with updateFromLog, will need to diff from empty tree to
|
2019-06-04 18:13:15 +00:00
|
|
|
|
> current git-annex branch, may be expensive to do!
|
2019-11-06 20:43:52 +00:00
|
|
|
|
|
|
|
|
|
(done; will be done automatically by the first command that needs to
|
|
|
|
|
use the db)
|
2019-11-06 20:27:25 +00:00
|
|
|
|
>
|
2019-06-04 18:13:15 +00:00
|
|
|
|
> * Export (IKey, SFilePath)
|
|
|
|
|
> difficult to rebuild, what if in the middle of an interrupted
|
|
|
|
|
> export?
|
|
|
|
|
>
|
2019-11-06 18:23:00 +00:00
|
|
|
|
> updateExportTreeFromLog only updates two tables (ExportTree and
|
|
|
|
|
> ExportTreeCurrent), not others (Exported and ExportedDirectory).
|
2019-06-04 18:13:15 +00:00
|
|
|
|
>
|
|
|
|
|
> Conceptually, this is the same as the repo being lost and another
|
|
|
|
|
> clone being used to update the export. The clone can only learn
|
|
|
|
|
> export state from the log. It's supposed to recover from such
|
|
|
|
|
> situations, the next time an export is run, so should be ok.
|
2019-11-07 17:10:39 +00:00
|
|
|
|
>
|
|
|
|
|
> An interrupted export won't resume where it left off, since the
|
|
|
|
|
> information about a partial export doesn't reach the git-annex branch.
|
|
|
|
|
> So it will re-send some files on resume. Documenting this should
|
|
|
|
|
> be good enough.
|
2019-11-06 21:13:39 +00:00
|
|
|
|
>
|
|
|
|
|
> Testing this an exporting to a directory with both exporttree=yes and
|
|
|
|
|
> importtree=yes, it refused to let an interrupted export proceed after
|
|
|
|
|
> upgrade, with "unsafe to overwrite file". An import resolved the
|
|
|
|
|
> problem. Guess the problem is that the content idenfifier did not
|
|
|
|
|
> get recorded in the git-annex branch and the db value was lost on upgrade.
|
2019-11-06 20:27:25 +00:00
|
|
|
|
>
|
2019-11-07 17:10:39 +00:00
|
|
|
|
> If the export is not interrupted before upgrade, later exports are able
|
|
|
|
|
> to overwrite the exported files as they should.
|
|
|
|
|
>
|
|
|
|
|
> Hmm, testing again, with a script, and interrupting the export, I am
|
|
|
|
|
> not able to reproduce the problem either. The cids of exported files
|
|
|
|
|
> get written to the journal, so make their way into the git-annex branch
|
|
|
|
|
> and so the cid db gets repopulated. Perhaps my earlier manual test
|
|
|
|
|
> was mistaken somehow.
|
|
|
|
|
>
|
2019-11-06 20:27:25 +00:00
|
|
|
|
> * Keys (IKey, SFilePath, SInodeCache)
|
2019-11-05 16:50:53 +00:00
|
|
|
|
> Use scanUnlockedFiles to repopulate the Associated table.
|
2019-06-04 18:13:15 +00:00
|
|
|
|
>
|
2019-11-05 16:50:53 +00:00
|
|
|
|
> But that does not repopulate the Content table. Doing so needs
|
2019-11-06 18:23:00 +00:00
|
|
|
|
> to iterate over the unlocked files, filter out any that are modified,
|
|
|
|
|
> and record the InodeCaches of the unmodified ones. Seems that it would
|
|
|
|
|
> have to use git's index to know which files are modified.
|
|
|
|
|
>
|
|
|
|
|
> There is a race; a file could be modified after getting the list of
|
|
|
|
|
> modified files. To completely avoid that race is tricky. To mostly
|
|
|
|
|
> eliminate it, just generate the InodeCache, then check
|
|
|
|
|
> if the file is still unmodified, then check if the InodeCache is still
|
|
|
|
|
> valid. That leaves some much less likely races where files are being
|
|
|
|
|
> repeatedly swapped and the InodeCache generations see one file while
|
|
|
|
|
> the git ls-files --modified see the other one.
|
|
|
|
|
>
|
|
|
|
|
> To fully avoid the race, use git ls-files --cached --debug,
|
|
|
|
|
> and parse the debug output into a InodeCache! This way the info
|
|
|
|
|
> from git's index is simply copied over into the git-annex database.
|
|
|
|
|
> One little problem: The --debug format is not specified and may change.
|
|
|
|
|
> However, it has never actually changed since it was introduced in 2010
|
|
|
|
|
> (git v1.8.3.1), except for a fix for an unsigned int overflow bug that
|
|
|
|
|
> was fixed in April 2019.
|
2019-11-06 20:27:25 +00:00
|
|
|
|
>
|
|
|
|
|
> (done)
|
2019-11-06 18:23:00 +00:00
|
|
|
|
>
|
|
|
|
|
> Alternatively, can keep the old database code and use it to read the old
|
|
|
|
|
> databases during the migration. But then bad data that got in due to the
|
|
|
|
|
> encoding problems will persist.
|
2019-11-07 17:23:32 +00:00
|
|
|
|
|
|
|
|
|
[[done]] --[[Joey]]
|