Collection of non-ideal things about git-annex's use of sqlite databases. Would be good to improve these sometime, but it would need a migration process. * Database.Export.getExportedKey would be faster if there was an index in the database, eg "ExportedIndex file key". This only affects the speed of `git annex export`, which is probably swamped by the actual upload of the data to the remote. * There may be other selects elsewhere that are not indexed. * Database.Types has some suboptimal encodings for Key and InodeCache. They are both slow due to being implemented using String (which may be fixable w/o changing the DB schema), and the VARCHARs they generate are longer than necessary since they look like eg `SKey "whatever"` and `I "whatever"` * SFilePath is stored efficiently, and has to be a String anyway, (until ByteStringFilePath is used) but since it's stored as a VARCHAR, which sqlite interprets using the current locale, there can be encoding problems. This is at least worked around with a hack that escapes FilePaths that contain unusual characters. It would be much better to use a BLOB. Also, when LANG=C is sometimes used, the hack can result in duplicates with different representations of the same filename, like this: INSERT INTO associated VALUES(4,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a,'foo.ü'); INSERT INTO associated VALUES(5,'SHA256E-s30--7d51d2454391a40e952bea478e45d64cf0d606e1e8c0652bb815a22e0e23419a','"foo.\56515\56508"'); See for an example of how this can happen. And it seems likely that a query by filename would fail if the filename was in the database but with a different encoding. * IKey could fail to round-trip as well, when a Key contains something (eg, a filename extension) that is not valid in the current locale, for similar reasons to SFilePath. Using BLOB would be better. See [[!commit cf260d9a159050e2a7e70394fdd8db289c805ec3]] for details about the encoding problem for SFilePath. I reproduced a similar problem for IKey by making a file `foo.ü` and running `git add` on it in a unicode locale. Then with LANG=C, `git annex drop --force foo.ü` thinks it drops the content, but in fact the work tree file is left containing the dropped content. The database then contained: INSERT INTO associated VALUES(8,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.ü','foo.ü'); INSERT INTO associated VALUES(9,'SHA256E-s30--59594eea8d6f64156b3ce6530cc3a661739abf2a0b72443de8683c34b0b19344.��','"foo.\56515\56508"'); > Investigated this in more detail, and I can't find a way to > solve the encoding problem other than changing the encoding > SKey, IKey, and SFilePath in a non-backwards-compatible way. > > Probably the encoding problem is actually not in sqlite, but > in persistent's use of Text internally. I did some tests with sqlite3 > command and it did not seem to vary query results based on the locale > when using VARCHAR values. I was able to successfully insert an > invalid unicode `ff` byte into it, and get the same byte back out. > > Unfortunately, it's not possible to make persistent not use Text > for VARCHAR. While its PersistDbSpecific lets a non-Text value be stored > as VARCHAR, any VARCHAR value coming out of the database gets converted > to a PersistText. > > So that seems to leave using a BLOB to store a ByteString for > SKey, IKey, and SFilePath. Attached patch shows how to do that, > but old git-annex won't be able to read the updated databases, > and won't know that it can't read them! > > This seems to call for a flag day, throwing out the old database > contents and regenerating them from other data: > > * Fsck (SKey) > can't rebuild? Just drop and let incremental fscks re-do work > * ContentIdentifier (IKey) > rebuild with updateFromLog, would need to diff from empty tree to > current git-annex branch, may be expensive to do! > * Export (IKey, SFilePath) > difficult to rebuild, what if in the middle of an interrupted > export? > > updateExportTreeFromLog only updates two tables, not others > > Conceptually, this is the same as the repo being lost and another > clone being used to update the export. The clone can only learn > export state from the log. It's supposed to recover from such > situations, the next time an export is run, so should be ok. > But it might result in already exported files being re-uploaded, > or other unncessary work. > Keys (IKey, SFilePath) > rebuild with scanUnlockedFiles > > does that update the Content table with the InodeCache? > > But after such a transition, how to communicate to the old git-annex > that it can't use the databases any longer? Moving the databases > out of the way won't do; old git-annex will just recreate them and > start with missing data! > > And, what about users who use a mix of old and new git-annex versions? > > Seems this needs an annex.version bump from v7 to v8. ---- [[!format patch """ diff --git a/Database/Types.hs b/Database/Types.hs index f08cf4e9d..3e9c9e267 100644 --- a/Database/Types.hs +++ b/Database/Types.hs @@ -14,11 +14,12 @@ import Database.Persist.TH import Database.Persist.Class hiding (Key) import Database.Persist.Sql hiding (Key) import Data.Maybe -import Data.Char import qualified Data.ByteString as S +import qualified Data.ByteString.Lazy as L import qualified Data.Text as T import Utility.PartialPrelude +import Utility.FileSystemEncoding import Key import Utility.InodeCache import Git.Types (Ref(..)) @@ -37,23 +38,18 @@ fromSKey (SKey s) = fromMaybe (error $ "bad serialized Key " ++ s) (deserializeK derivePersistField "SKey" --- A Key index. More efficient than SKey, but its Read instance does not --- work when it's used in any kind of complex data structure. -newtype IKey = IKey String - -instance Read IKey where - readsPrec _ s = [(IKey s, "")] - -instance Show IKey where - show (IKey s) = s +-- A Key index. More efficient than SKey. +newtype IKey = IKey S.ByteString + deriving (Eq, Show, PersistField, PersistFieldSql) +-- FIXME: toStrict copies, not efficient toIKey :: Key -> IKey -toIKey = IKey . serializeKey +toIKey = IKey . L.toStrict . serializeKey' fromIKey :: IKey -> Key -fromIKey (IKey s) = fromMaybe (error $ "bad serialized Key " ++ s) (deserializeKey s) - -derivePersistField "IKey" +fromIKey (IKey b) = fromMaybe + (error $ "bad serialized Key " ++ show b) + (deserializeKey' b) -- A serialized InodeCache newtype SInodeCache = I String @@ -67,39 +63,15 @@ fromSInodeCache (I s) = fromMaybe (error $ "bad serialized InodeCache " ++ s) (r derivePersistField "SInodeCache" --- A serialized FilePath. --- --- Not all unicode characters round-trip through sqlite. In particular, --- surrigate code points do not. So, escape the FilePath. But, only when --- it contains such characters. -newtype SFilePath = SFilePath String - --- Note that Read instance does not work when used in any kind of complex --- data structure. -instance Read SFilePath where - readsPrec _ s = [(SFilePath s, "")] - -instance Show SFilePath where - show (SFilePath s) = s +-- A serialized FilePath. Stored as a ByteString to avoid encoding problems. +newtype SFilePath = SFilePath S.ByteString + deriving (Eq, Show, PersistField, PersistFieldSql) toSFilePath :: FilePath -> SFilePath -toSFilePath s@('"':_) = SFilePath (show s) -toSFilePath s - | any needsescape s = SFilePath (show s) - | otherwise = SFilePath s - where - needsescape c = case generalCategory c of - Surrogate -> True - PrivateUse -> True - NotAssigned -> True - _ -> False +toSFilePath = SFilePath . encodeBS fromSFilePath :: SFilePath -> FilePath -fromSFilePath (SFilePath s@('"':_)) = - fromMaybe (error "bad serialized SFilePath " ++ s) (readish s) -fromSFilePath (SFilePath s) = s - -derivePersistField "SFilePath" +fromSFilePath (SFilePath b) = decodeBS b -- A serialized Ref newtype SRef = SRef Ref """]]