170 lines
		
	
	
	
		
			7.6 KiB
			
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			170 lines
		
	
	
	
		
			7.6 KiB
			
		
	
	
	
		
			Markdown
		
	
	
	
	
	
The `sqlite` branch changes the databases, updating annex.version to 8.
 | 
						||
The branch is ready to merge, but it might be deferred until Q1 2020
 | 
						||
to avoid user upgrade fatigue. Discussion below is about the motivation for
 | 
						||
these changes.
 | 
						||
 | 
						||
----
 | 
						||
 | 
						||
Collection of non-ideal things about git-annex's use of sqlite databases.
 | 
						||
Would be good to improve these sometime, but it would need a migration
 | 
						||
process.
 | 
						||
 | 
						||
 | 
						||
* Database.Keys.SQL.isInodeKnown has some really ugly SQL LIKE queries. 
 | 
						||
  Probably an index would not speed them up. They're only needed when
 | 
						||
  git-annex detects inodes are not stable, eg on fat or probably windows.
 | 
						||
  A better database
 | 
						||
  schema should be able to eliminate the need for those LIKE queries.
 | 
						||
  Eg, store the size and allowable mtimes in a separate table that is
 | 
						||
  queried when necessary.
 | 
						||
 | 
						||
  Fixed.
 | 
						||
 | 
						||
* Several selects were not able to use indexes, so would be slow.
 | 
						||
 | 
						||
  Fixed by adding indexes.
 | 
						||
 | 
						||
* Database.Types has some suboptimal encodings for Key and InodeCache.
 | 
						||
  They are both slow due to being implemented using String
 | 
						||
  (which may be fixable w/o changing the DB schema),
 | 
						||
  and the VARCHARs they generate are longer than necessary
 | 
						||
  since they look like eg `SKey "whatever"` and `I "whatever"`
 | 
						||
 | 
						||
  Fixed.
 | 
						||
 | 
						||
* SFilePath is stored efficiently, and has to be a String anyway,
 | 
						||
  (until ByteStringFilePath is used)
 | 
						||
  but since it's stored as a VARCHAR, which sqlite interprets using the
 | 
						||
  current locale, there can be encoding problems. This is at least worked
 | 
						||
  around with a hack that escapes FilePaths that contain unusual
 | 
						||
  characters. It would be much better to use a BLOB.
 | 
						||
 | 
						||
  Also, when LANG=C is sometimes used, the hack can result in duplicates with
 | 
						||
  different representations of the same filename, like this:
 | 
						||
 | 
						||
	INSERT INTO associated VALUES(4,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a,'foo.ü');
 | 
						||
	INSERT INTO associated VALUES(5,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a','"foo.\56515\56508"');
 | 
						||
 | 
						||
  See <http://git-annex.branchable.com/bugs/assistant_crashes_in_TransferScanner/>
 | 
						||
  for an example of how this can happen.
 | 
						||
 | 
						||
  And it seems likely that a query by filename would fail if the filename
 | 
						||
  was in the database but with a different encoding.
 | 
						||
 | 
						||
  Fixed by converting to blob.
 | 
						||
 | 
						||
* SKey and IKey could fail to round-trip as well, when a Key contains something
 | 
						||
  (eg, a filename extension) that is not valid in the current locale,
 | 
						||
  for similar reasons to SFilePath. Using BLOB would be better.
 | 
						||
 | 
						||
  See [[!commit cf260d9a159050e2a7e70394fdd8db289c805ec3]] for details
 | 
						||
  about the encoding problem for SFilePath. I reproduced a similar problem
 | 
						||
  for IKey by making a file `foo.ü` and running `git add` on  it in a unicode
 | 
						||
  locale. Then with LANG=C, `git annex drop --force foo.ü` thinks
 | 
						||
  it drops the content, but in fact the work tree file is left containing
 | 
						||
  the dropped content. The database then contained:
 | 
						||
 | 
						||
	INSERT INTO associated VALUES(8,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.ü','foo.ü');
 | 
						||
	INSERT INTO associated VALUES(9,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.<2E><>','"foo.\56515\56508"');
 | 
						||
 | 
						||
  Fixed by converting to blob.
 | 
						||
 | 
						||
* migration
 | 
						||
 | 
						||
> Investigated this in more detail, and I can't find a way to
 | 
						||
> solve the encoding problem other than changing the encoding
 | 
						||
> SKey, IKey, and SFilePath in a non-backwards-compatible way.
 | 
						||
> 
 | 
						||
> Probably the encoding problem is actually not in sqlite, but
 | 
						||
> in persistent's use of Text internally. I did some tests with sqlite3
 | 
						||
> command and it did not seem to vary query results based on the locale
 | 
						||
> when using VARCHAR values. I was able to successfully insert an 
 | 
						||
> invalid unicode `ff` byte into it, and get the same byte back out.
 | 
						||
> 
 | 
						||
> Unfortunately, it's not possible to make persistent not use Text
 | 
						||
> for VARCHAR. While its PersistDbSpecific lets a non-Text value be stored
 | 
						||
> as VARCHAR, any VARCHAR value coming out of the database gets converted
 | 
						||
> to a PersistText.
 | 
						||
> 
 | 
						||
> So that seems to leave using a BLOB to store a ByteString for 
 | 
						||
> SKey, IKey, and SFilePath. But old git-annex won't be able to
 | 
						||
> read the updated databases, and won't know that it can't read them!
 | 
						||
> 
 | 
						||
> This seems to call for a flag day, throwing out the old database
 | 
						||
> contents and regenerating them from other data:
 | 
						||
> 
 | 
						||
> * Fsck (SKey)
 | 
						||
>   can't rebuild? Just drop and let incremental fscks re-do work
 | 
						||
>
 | 
						||
>   (done)
 | 
						||
 | 
						||
> * ContentIdentifier (IKey)  
 | 
						||
>   rebuild with updateFromLog, will need to diff from empty tree to
 | 
						||
>   current git-annex branch, may be expensive to do!
 | 
						||
 | 
						||
    (done; will be done automatically by the first command that needs to
 | 
						||
    use the db)
 | 
						||
> 
 | 
						||
> * Export (IKey, SFilePath)  
 | 
						||
>   difficult to rebuild, what if in the middle of an interrupted
 | 
						||
>   export?
 | 
						||
>   
 | 
						||
>   updateExportTreeFromLog only updates two tables (ExportTree and
 | 
						||
>   ExportTreeCurrent), not others (Exported and ExportedDirectory).
 | 
						||
>   
 | 
						||
>   Conceptually, this is the same as the repo being lost and another
 | 
						||
>   clone being used to update the export. The clone can only learn
 | 
						||
>   export state from the log. It's supposed to recover from such
 | 
						||
>   situations, the next time an export is run, so should be ok.
 | 
						||
>   
 | 
						||
>   An interrupted export won't resume where it left off, since the
 | 
						||
>   information about a partial export doesn't reach the git-annex branch.
 | 
						||
>   So it will re-send some files on resume. Documenting this should
 | 
						||
>   be good enough.
 | 
						||
>
 | 
						||
>   Testing this an exporting to a directory with both exporttree=yes and
 | 
						||
>   importtree=yes, it refused to let an interrupted export proceed after
 | 
						||
>   upgrade, with "unsafe to overwrite file". An import resolved the
 | 
						||
>   problem. Guess the problem is that the content idenfifier did not
 | 
						||
>   get recorded in the git-annex branch and the db value was lost on upgrade.
 | 
						||
> 
 | 
						||
>   If the export is not interrupted before upgrade, later exports are able
 | 
						||
>   to overwrite the exported files as they should.
 | 
						||
>
 | 
						||
>   Hmm, testing again, with a script, and interrupting the export, I am
 | 
						||
>   not able to reproduce the problem either. The cids of exported files
 | 
						||
>   get written to the journal, so make their way into the git-annex branch
 | 
						||
>   and so the cid db gets repopulated. Perhaps my earlier manual test
 | 
						||
>   was mistaken somehow.
 | 
						||
> 
 | 
						||
> * Keys (IKey, SFilePath, SInodeCache)
 | 
						||
>   Use scanUnlockedFiles to repopulate the Associated table.
 | 
						||
> 
 | 
						||
>   But that does not repopulate the Content table. Doing so needs
 | 
						||
>   to iterate over the unlocked files, filter out any that are modified,
 | 
						||
>   and record the InodeCaches of the unmodified ones. Seems that it would
 | 
						||
>   have to use git's index to know which files are modified.
 | 
						||
>   
 | 
						||
>   There is a race; a file could be modified after getting the list of
 | 
						||
>   modified files. To completely avoid that race is tricky. To mostly
 | 
						||
>   eliminate it, just generate the InodeCache, then check
 | 
						||
>   if the file is still unmodified, then check if the InodeCache is still
 | 
						||
>   valid. That leaves some much less likely races where files are being
 | 
						||
>   repeatedly swapped and the InodeCache generations see one file while
 | 
						||
>   the git ls-files --modified see the other one.
 | 
						||
>
 | 
						||
>   To fully avoid the race, use git ls-files --cached --debug,
 | 
						||
>   and parse the debug output into a InodeCache! This way the info
 | 
						||
>   from git's index is simply copied over into the git-annex database.
 | 
						||
>   One little problem: The --debug format is not specified and may change.
 | 
						||
>   However, it has never actually changed since it was introduced in 2010
 | 
						||
>   (git v1.8.3.1), except for a fix for an unsigned int overflow bug that
 | 
						||
>   was fixed in April 2019.
 | 
						||
>
 | 
						||
>   (done)
 | 
						||
> 
 | 
						||
> Alternatively, can keep the old database code and use it to read the old
 | 
						||
> databases during the migration. But then bad data that got in due to the
 | 
						||
> encoding problems will persist.
 | 
						||
 | 
						||
[[done]] --[[Joey]]
 |