It's really persistent causing the problem, and BLOB really seems the only way around it.
		
			
				
	
	
		
			203 lines
		
	
	
	
		
			7.9 KiB
			
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			203 lines
		
	
	
	
		
			7.9 KiB
			
		
	
	
	
		
			Markdown
		
	
	
	
	
	
Collection of non-ideal things about git-annex's use of sqlite databases.
 | 
						||
Would be good to improve these sometime, but it would need a migration
 | 
						||
process.
 | 
						||
 | 
						||
* Database.Export.getExportedKey would be faster if there was an index
 | 
						||
  in the database, eg "ExportedIndex file key". This only affects
 | 
						||
  the speed of `git annex export`, which is probably swamped by the actual
 | 
						||
  upload of the data to the remote.
 | 
						||
 | 
						||
* There may be other selects elsewhere that are not indexed.
 | 
						||
 | 
						||
* Database.Types has some suboptimal encodings for Key and InodeCache.
 | 
						||
  They are both slow due to being implemented using String
 | 
						||
  (which may be fixable w/o changing the DB schema),
 | 
						||
  and the VARCHARs they generate are longer than necessary
 | 
						||
  since they look like eg `SKey "whatever"` and `I "whatever"`
 | 
						||
 | 
						||
* SFilePath is stored efficiently, and has to be a String anyway,
 | 
						||
  (until ByteStringFilePath is used)
 | 
						||
  but since it's stored as a VARCHAR, which sqlite interprets using the
 | 
						||
  current locale, there can be encoding problems. This is at least worked
 | 
						||
  around with a hack that escapes FilePaths that contain unusual
 | 
						||
  characters. It would be much better to use a BLOB.
 | 
						||
 | 
						||
  Also, when LANG=C is sometimes used, the hack can result in duplicates with
 | 
						||
  different representations of the same filename, like this:
 | 
						||
 | 
						||
	INSERT INTO associated VALUES(4,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a,'foo.ü');
 | 
						||
	INSERT INTO associated VALUES(5,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a','"foo.\56515\56508"');
 | 
						||
 | 
						||
  See <http://git-annex.branchable.com/bugs/assistant_crashes_in_TransferScanner/>
 | 
						||
  for an example of how this can happen.
 | 
						||
 | 
						||
  And it seems likely that a query by filename would fail if the filename
 | 
						||
  was in the database but with a different encoding.
 | 
						||
 | 
						||
* IKey could fail to round-trip as well, when a Key contains something
 | 
						||
  (eg, a filename extension) that is not valid in the current locale,
 | 
						||
  for similar reasons to SFilePath. Using BLOB would be better.
 | 
						||
 | 
						||
  See [[!commit cf260d9a159050e2a7e70394fdd8db289c805ec3]] for details
 | 
						||
  about the encoding problem for SFilePath. I reproduced a similar problem
 | 
						||
  for IKey by making a file `foo.ü` and running `git add` on  it in a unicode
 | 
						||
  locale. Then with LANG=C, `git annex drop --force foo.ü` thinks
 | 
						||
  it drops the content, but in fact the work tree file is left containing
 | 
						||
  the dropped content. The database then contained:
 | 
						||
 | 
						||
	INSERT INTO associated VALUES(8,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.ü','foo.ü');
 | 
						||
	INSERT INTO associated VALUES(9,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.<2E><>','"foo.\56515\56508"');
 | 
						||
 | 
						||
> Investigated this in more detail, and I can't find a way to
 | 
						||
> solve the encoding problem other than changing the encoding
 | 
						||
> SKey, IKey, and SFilePath in a non-backwards-compatible way.
 | 
						||
> 
 | 
						||
> Probably the encoding problem is actually not in sqlite, but
 | 
						||
> in persistent's use of Text internally. I did some tests with sqlite3
 | 
						||
> command and it did not seem to vary query results based on the locale
 | 
						||
> when using VARCHAR values. I was able to successfully insert an 
 | 
						||
> invalid unicode `ff` byte into it, and get the same byte back out.
 | 
						||
> 
 | 
						||
> Unfortunately, it's not possible to make persistent not use Text
 | 
						||
> for VARCHAR. While its PersistDbSpecific lets a non-Text value be stored
 | 
						||
> as VARCHAR, any VARCHAR value coming out of the database gets converted
 | 
						||
> to a PersistText.
 | 
						||
> 
 | 
						||
> So that seems to leave using a BLOB to store a ByteString for 
 | 
						||
> SKey, IKey, and SFilePath. Attached patch shows how to do that,
 | 
						||
> but old git-annex won't be able to read the updated databases,
 | 
						||
> and won't know that it can't read them!
 | 
						||
> 
 | 
						||
> This seems to call for a flag day, throwing out the old database
 | 
						||
> contents and regenerating them from other data:
 | 
						||
> 
 | 
						||
> * Fsck (SKey)
 | 
						||
>   can't rebuild? Just drop and let incremental fscks re-do work
 | 
						||
> * ContentIdentifier (IKey)  
 | 
						||
>   rebuild with updateFromLog, would need to diff from empty tree to
 | 
						||
>   current git-annex branch, may be expensive to do!
 | 
						||
> * Export (IKey, SFilePath)  
 | 
						||
>   difficult to rebuild, what if in the middle of an interrupted
 | 
						||
>   export?
 | 
						||
>   
 | 
						||
>   updateExportTreeFromLog only updates two tables, not others
 | 
						||
>   
 | 
						||
>   Conceptually, this is the same as the repo being lost and another
 | 
						||
>   clone being used to update the export. The clone can only learn
 | 
						||
>   export state from the log. It's supposed to recover from such
 | 
						||
>   situations, the next time an export is run, so should be ok.
 | 
						||
>   But it might result in already exported files being re-uploaded,
 | 
						||
>   or other unncessary work.
 | 
						||
> Keys (IKey, SFilePath)
 | 
						||
>   rebuild with scanUnlockedFiles
 | 
						||
> 
 | 
						||
>   does that update the Content table with the InodeCache?
 | 
						||
>
 | 
						||
> But after such a transition, how to communicate to the old git-annex
 | 
						||
> that it can't use the databases any longer? Moving the databases
 | 
						||
> out of the way won't do; old git-annex will just recreate them and 
 | 
						||
> start with missing data!
 | 
						||
> 
 | 
						||
> And, what about users who use a mix of old and new git-annex versions?
 | 
						||
> 
 | 
						||
> Seems this needs an annex.version bump from v7 to v8.
 | 
						||
 | 
						||
----
 | 
						||
 | 
						||
[[!format patch """
 | 
						||
diff --git a/Database/Types.hs b/Database/Types.hs
 | 
						||
index f08cf4e9d..3e9c9e267 100644
 | 
						||
--- a/Database/Types.hs
 | 
						||
+++ b/Database/Types.hs
 | 
						||
@@ -14,11 +14,12 @@ import Database.Persist.TH
 | 
						||
 import Database.Persist.Class hiding (Key)
 | 
						||
 import Database.Persist.Sql hiding (Key)
 | 
						||
 import Data.Maybe
 | 
						||
-import Data.Char
 | 
						||
 import qualified Data.ByteString as S
 | 
						||
+import qualified Data.ByteString.Lazy as L
 | 
						||
 import qualified Data.Text as T
 | 
						||
 
 | 
						||
 import Utility.PartialPrelude
 | 
						||
+import Utility.FileSystemEncoding
 | 
						||
 import Key
 | 
						||
 import Utility.InodeCache
 | 
						||
 import Git.Types (Ref(..))
 | 
						||
@@ -37,23 +38,18 @@ fromSKey (SKey s) = fromMaybe (error $ "bad serialized Key " ++ s) (deserializeK
 | 
						||
 
 | 
						||
 derivePersistField "SKey"
 | 
						||
 
 | 
						||
--- A Key index. More efficient than SKey, but its Read instance does not
 | 
						||
--- work when it's used in any kind of complex data structure.
 | 
						||
-newtype IKey = IKey String
 | 
						||
-
 | 
						||
-instance Read IKey where
 | 
						||
-	readsPrec _ s = [(IKey s, "")]
 | 
						||
-
 | 
						||
-instance Show IKey where
 | 
						||
-	show (IKey s) = s
 | 
						||
+-- A Key index. More efficient than SKey.
 | 
						||
+newtype IKey = IKey S.ByteString
 | 
						||
+	deriving (Eq, Show, PersistField, PersistFieldSql)
 | 
						||
 
 | 
						||
+-- FIXME: toStrict copies, not efficient
 | 
						||
 toIKey :: Key -> IKey
 | 
						||
-toIKey = IKey . serializeKey
 | 
						||
+toIKey = IKey . L.toStrict . serializeKey'
 | 
						||
 
 | 
						||
 fromIKey :: IKey -> Key
 | 
						||
-fromIKey (IKey s) = fromMaybe (error $ "bad serialized Key " ++ s) (deserializeKey s)
 | 
						||
-
 | 
						||
-derivePersistField "IKey"
 | 
						||
+fromIKey (IKey b) = fromMaybe
 | 
						||
+	(error $ "bad serialized Key " ++ show b) 
 | 
						||
+	(deserializeKey' b)
 | 
						||
 
 | 
						||
 -- A serialized InodeCache
 | 
						||
 newtype SInodeCache = I String
 | 
						||
@@ -67,39 +63,15 @@ fromSInodeCache (I s) = fromMaybe (error $ "bad serialized InodeCache " ++ s) (r
 | 
						||
 
 | 
						||
 derivePersistField "SInodeCache"
 | 
						||
 
 | 
						||
--- A serialized FilePath.
 | 
						||
---
 | 
						||
--- Not all unicode characters round-trip through sqlite. In particular,
 | 
						||
--- surrigate code points do not. So, escape the FilePath. But, only when
 | 
						||
--- it contains such characters.
 | 
						||
-newtype SFilePath = SFilePath String
 | 
						||
-
 | 
						||
--- Note that Read instance does not work when used in any kind of complex
 | 
						||
--- data structure.
 | 
						||
-instance Read SFilePath where
 | 
						||
-	readsPrec _ s = [(SFilePath s, "")]
 | 
						||
-
 | 
						||
-instance Show SFilePath where
 | 
						||
-	show (SFilePath s) = s
 | 
						||
+-- A serialized FilePath. Stored as a ByteString to avoid encoding problems.
 | 
						||
+newtype SFilePath = SFilePath S.ByteString
 | 
						||
+	deriving (Eq, Show, PersistField, PersistFieldSql)
 | 
						||
 
 | 
						||
 toSFilePath :: FilePath -> SFilePath
 | 
						||
-toSFilePath s@('"':_) = SFilePath (show s)
 | 
						||
-toSFilePath s
 | 
						||
-	| any needsescape s = SFilePath (show s)
 | 
						||
-	| otherwise = SFilePath s
 | 
						||
-  where
 | 
						||
-	needsescape c = case generalCategory c of
 | 
						||
-		Surrogate -> True
 | 
						||
-		PrivateUse -> True
 | 
						||
-		NotAssigned -> True
 | 
						||
-		_ -> False
 | 
						||
+toSFilePath = SFilePath . encodeBS
 | 
						||
 
 | 
						||
 fromSFilePath :: SFilePath -> FilePath
 | 
						||
-fromSFilePath (SFilePath s@('"':_)) =
 | 
						||
-	fromMaybe (error "bad serialized SFilePath " ++ s) (readish s)
 | 
						||
-fromSFilePath (SFilePath s) = s
 | 
						||
-
 | 
						||
-derivePersistField "SFilePath"
 | 
						||
+fromSFilePath (SFilePath b) = decodeBS b
 | 
						||
 
 | 
						||
 -- A serialized Ref
 | 
						||
 newtype SRef = SRef Ref
 | 
						||
"""]]
 |