use importChanges optimisation

Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.

Benchmarks:

Importing from a special remote that has 10000 files, that have all been
imported before, and 1 new file sped up from 26.06 to 2.59 seconds.

An import with no change and 10000 unchanged files sped up from 24.3 to
1.99 seconds.

Going up to 20000 files, an import with no changes sped up from
125.95 to 3.84 seconds.

Sponsored-by: k0ld on Patreon
This commit is contained in:
Joey Hess 2023-06-01 13:46:16 -04:00
parent 029b08f54b
commit 40017089f2
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
5 changed files with 174 additions and 86 deletions

View file

@ -15,6 +15,7 @@ module Annex.Import (
recordImportTree, recordImportTree,
canImportKeys, canImportKeys,
ImportResult(..), ImportResult(..),
Imported,
importChanges, importChanges,
importKeys, importKeys,
makeImportMatcher, makeImportMatcher,
@ -41,6 +42,7 @@ import Annex.RemoteTrackingBranch
import Annex.HashObject import Annex.HashObject
import Annex.Transfer import Annex.Transfer
import Annex.CheckIgnore import Annex.CheckIgnore
import Annex.CatFile
import Annex.VectorClock import Annex.VectorClock
import Command import Command
import Backend import Backend
@ -104,9 +106,9 @@ buildImportCommit
:: Remote :: Remote
-> ImportTreeConfig -> ImportTreeConfig
-> ImportCommitConfig -> ImportCommitConfig
-> ImportableContentsChunkable Annex (Either Sha Key) -> Imported
-> Annex (Maybe Ref) -> Annex (Maybe Ref)
buildImportCommit remote importtreeconfig importcommitconfig importable = buildImportCommit remote importtreeconfig importcommitconfig imported =
case importCommitTracking importcommitconfig of case importCommitTracking importcommitconfig of
Nothing -> go Nothing Nothing -> go Nothing
Just trackingcommit -> inRepo (Git.Ref.tree trackingcommit) >>= \case Just trackingcommit -> inRepo (Git.Ref.tree trackingcommit) >>= \case
@ -114,8 +116,8 @@ buildImportCommit remote importtreeconfig importcommitconfig importable =
Just _ -> go (Just trackingcommit) Just _ -> go (Just trackingcommit)
where where
go trackingcommit = do go trackingcommit = do
(imported, updatestate) <- recordImportTree remote importtreeconfig importable (importedtree, updatestate) <- recordImportTree remote importtreeconfig imported
buildImportCommit' remote importcommitconfig trackingcommit imported >>= \case buildImportCommit' remote importcommitconfig trackingcommit importedtree >>= \case
Just finalcommit -> do Just finalcommit -> do
updatestate updatestate
return (Just finalcommit) return (Just finalcommit)
@ -129,11 +131,11 @@ buildImportCommit remote importtreeconfig importcommitconfig importable =
recordImportTree recordImportTree
:: Remote :: Remote
-> ImportTreeConfig -> ImportTreeConfig
-> ImportableContentsChunkable Annex (Either Sha Key) -> Imported
-> Annex (History Sha, Annex ()) -> Annex (History Sha, Annex ())
recordImportTree remote importtreeconfig importable = do recordImportTree remote importtreeconfig imported = do
imported@(History finaltree _) <- buildImportTrees basetree subdir importable importedtree@(History finaltree _) <- buildImportTrees basetree subdir imported
return (imported, updatestate finaltree) return (importedtree, updatestate finaltree)
where where
basetree = case importtreeconfig of basetree = case importtreeconfig of
ImportTree -> emptyTree ImportTree -> emptyTree
@ -265,34 +267,69 @@ buildImportCommit' remote importcommitconfig mtrackingcommit imported@(History t
parents <- mapM (mknewcommits oldhc old) (S.toList hs) parents <- mapM (mknewcommits oldhc old) (S.toList hs)
mkcommit parents importedtree mkcommit parents importedtree
{- Builds a history of git trees reflecting the ImportableContents. {- Builds a history of git trees for an import.
- -
- When a subdir is provided, imported tree is grafted into the basetree at - When a subdir is provided, the imported tree is grafted into
- that location, replacing any object that was there. - the basetree at that location, replacing any object that was there.
-} -}
buildImportTrees buildImportTrees
:: Ref :: Ref
-> Maybe TopFilePath -> Maybe TopFilePath
-> ImportableContentsChunkable Annex (Either Sha Key) -> Imported
-> Annex (History Sha) -> Annex (History Sha)
buildImportTrees = buildImportTreesGeneric convertImportTree buildImportTrees basetree msubdir (ImportedFull imported) =
buildImportTreesGeneric convertImportTree basetree msubdir imported
buildImportTrees basetree msubdir (ImportedDiff (LastImportedTree oldtree) imported) = do
importtree <- if null (importableContents imported)
then pure oldtree
else applydiff
repo <- Annex.gitRepo
t <- withMkTreeHandle repo $
graftImportTree basetree msubdir importtree
-- Diffing is not currently implemented when the history is not empty.
return (History t mempty)
where
applydiff = do
let (removed, new) = partition isremoved
(importableContents imported)
newtreeitems <- catMaybes <$> mapM mktreeitem new
let removedfiles = map (mkloc . fst) removed
inRepo $ adjustTree
(pure . Just)
-- ^ keep files that are not added/removed the same
newtreeitems
(\_oldti newti -> newti)
-- ^ prefer newly added version of file
removedfiles
oldtree
mktreeitem (loc, DiffChanged v) =
Just <$> mkImportTreeItem msubdir loc v
mktreeitem (_, DiffRemoved) =
pure Nothing
mkloc = asTopFilePath . fromImportLocation
isremoved (_, v) = v == DiffRemoved
convertImportTree :: Maybe TopFilePath -> [(ImportLocation, Either Sha Key)] -> Annex Tree convertImportTree :: Maybe TopFilePath -> [(ImportLocation, Either Sha Key)] -> Annex Tree
convertImportTree msubdir ls = treeItemsToTree <$> mapM mktreeitem ls convertImportTree msubdir ls =
treeItemsToTree <$> mapM (uncurry $ mkImportTreeItem msubdir) ls
mkImportTreeItem :: Maybe TopFilePath -> ImportLocation -> Either Sha Key -> Annex TreeItem
mkImportTreeItem msubdir loc v = case v of
Right k -> do
relf <- fromRepo $ fromTopFilePath topf
symlink <- calcRepo $ gitAnnexLink relf k
linksha <- hashSymlink symlink
return $ TreeItem treepath (fromTreeItemType TreeSymlink) linksha
Left sha ->
return $ TreeItem treepath (fromTreeItemType TreeFile) sha
where where
mktreeitem (loc, v) = case v of lf = fromImportLocation loc
Right k -> do treepath = asTopFilePath lf
relf <- fromRepo $ fromTopFilePath topf topf = asTopFilePath $
symlink <- calcRepo $ gitAnnexLink relf k maybe lf (\sd -> getTopFilePath sd P.</> lf) msubdir
linksha <- hashSymlink symlink
return $ TreeItem treepath (fromTreeItemType TreeSymlink) linksha
Left sha ->
return $ TreeItem treepath (fromTreeItemType TreeFile) sha
where
lf = fromImportLocation loc
treepath = asTopFilePath lf
topf = asTopFilePath $
maybe lf (\sd -> getTopFilePath sd P.</> lf) msubdir
{- Builds a history of git trees using ContentIdentifiers. {- Builds a history of git trees using ContentIdentifiers.
- -
@ -426,11 +463,24 @@ canImportKeys remote importcontent =
where where
ia = Remote.importActions remote ia = Remote.importActions remote
-- Result of an import. ImportUnfinished indicates that some file failed to
-- be imported. Running again should resume where it left off.
data ImportResult t
= ImportFinished t
| ImportUnfinished
data Diffed t data Diffed t
= DiffChanged t = DiffChanged t
| DiffRemoved | DiffRemoved
deriving (Eq)
{- Diffs between the current and previous ContentIdentifier trees, and data Imported
= ImportedFull (ImportableContentsChunkable Annex (Either Sha Key))
| ImportedDiff LastImportedTree (ImportableContents (Diffed (Either Sha Key)))
newtype LastImportedTree = LastImportedTree Sha
{- Diffs between the previous and current ContentIdentifier trees, and
- runs importKeys on only the changed files. - runs importKeys on only the changed files.
- -
- This will download the same content as if importKeys were run on all - This will download the same content as if importKeys were run on all
@ -447,9 +497,7 @@ importChanges
-> Bool -> Bool
-> Bool -> Bool
-> ImportableContentsChunkable Annex (ContentIdentifier, ByteSize) -> ImportableContentsChunkable Annex (ContentIdentifier, ByteSize)
-> Annex (ImportResult (Either -> Annex (ImportResult Imported)
(ImportableContentsChunkable Annex (Either Sha Key))
(ImportableContentsChunkable Annex (Diffed (Either Sha Key)))))
importChanges remote importtreeconfig importcontent thirdpartypopulated importablecontents = do importChanges remote importtreeconfig importcontent thirdpartypopulated importablecontents = do
((History currcidtree currhistory), cidtreemap) <- buildContentIdentifierTree importablecontents ((History currcidtree currhistory), cidtreemap) <- buildContentIdentifierTree importablecontents
-- diffimport below does not handle history, so when there is -- diffimport below does not handle history, so when there is
@ -459,56 +507,90 @@ importChanges remote importtreeconfig importcontent thirdpartypopulated importab
else do else do
getContentIdentifierTree (Remote.uuid remote) >>= \case getContentIdentifierTree (Remote.uuid remote) >>= \case
Nothing -> fullimport currcidtree Nothing -> fullimport currcidtree
Just prevcidtree -> diffimport cidtreemap prevcidtree currcidtree Just prevcidtree -> candiffimport prevcidtree >>= \case
Nothing -> fullimport currcidtree
Just lastimportedtree -> diffimport cidtreemap prevcidtree currcidtree lastimportedtree
where where
remember = recordContentIdentifierTree (Remote.uuid remote) remember = recordContentIdentifierTree (Remote.uuid remote)
-- In order to use a diff, the previous ContentIdentifier tree must
-- not have been garbage collected. Which can happen since there
-- are no git refs to it.
--
-- Also, a tree must have been imported before, and that tree must
-- also have not been garbage collected (which is less likely to
-- happen due to the remote tracking branch).
candiffimport prevcidtree =
catObjectMetaData prevcidtree >>= \case
Nothing -> return Nothing
Just _ -> getLastImportedTree remote >>= \case
Nothing -> return Nothing
Just lastimported@(LastImportedTree t) ->
ifM (isJust <$> catObjectMetaData t)
( return (Just lastimported)
, return Nothing
)
fullimport currcidtree = fullimport currcidtree =
importKeys remote importtreeconfig importcontent thirdpartypopulated importablecontents >>= \case importKeys remote importtreeconfig importcontent thirdpartypopulated importablecontents >>= \case
ImportUnfinished -> return ImportUnfinished ImportUnfinished -> return ImportUnfinished
ImportFinished r -> do ImportFinished r -> do
remember currcidtree remember currcidtree
return $ ImportFinished $ Left r return $ ImportFinished $ ImportedFull r
diffimport cidtreemap prevcidtree currcidtree = do diffimport cidtreemap prevcidtree currcidtree lastimportedtree = do
(diff, cleanup) <- inRepo $ Git.DiffTree.diffTreeRecursive currcidtree prevcidtree (diff, cleanup) <- inRepo $ Git.DiffTree.diffTreeRecursive
let (removed, changed) = partition (\ti -> Git.DiffTree.dstsha ti `elem` nullShas) diff prevcidtree
let mkloc = mkImportLocation . getTopFilePath . Git.DiffTree.file currcidtree
let (removed, changed) = partition isremoval diff
let mkicchanged ti = do let mkicchanged ti = do
v <- M.lookup (Git.DiffTree.dstsha ti) cidtreemap v <- M.lookup (Git.DiffTree.dstsha ti) cidtreemap
return (mkloc ti, v) return (mkloc ti, v)
let ic = ImportableContentsComplete $ ImportableContents let ic = ImportableContentsComplete $ ImportableContents
{ importableContents = mapMaybe mkicchanged changed { importableContents = mapMaybe mkicchanged changed
, importableHistory = [] , importableHistory = []
} }
importKeys remote importtreeconfig importcontent thirdpartypopulated ic >>= \case importKeys remote importtreeconfig importcontent thirdpartypopulated ic >>= \case
ImportUnfinished -> do ImportUnfinished -> do
void $ liftIO cleanup void $ liftIO cleanup
return ImportUnfinished return ImportUnfinished
ImportFinished (ImportableContentsComplete ic') -> liftIO cleanup >>= \case ImportFinished (ImportableContentsComplete ic') ->
False -> return ImportUnfinished liftIO cleanup >>= \case
True -> do False -> return ImportUnfinished
remember currcidtree True -> do
let diffchanged = map remember currcidtree
(\(loc, v) -> (loc, DiffChanged v)) return $ ImportFinished $
(importableContents ic') ImportedDiff lastimportedtree
let diffremoved = map (mkdiff ic' removed)
(\ti -> (mkloc ti, DiffRemoved))
removed
let ic'' = ImportableContentsComplete $ ImportableContents
{ importableContents = diffremoved ++ diffchanged
, importableHistory = []
}
return $ ImportFinished $ Right ic''
-- importKeys is not passed ImportableContentsChunked -- importKeys is not passed ImportableContentsChunked
-- above, so it cannot return it -- above, so it cannot return it
ImportFinished (ImportableContentsChunked {}) -> error "internal" ImportFinished (ImportableContentsChunked {}) -> error "internal"
isremoval ti = Git.DiffTree.dstsha ti `elem` nullShas
mkloc = mkImportLocation . getTopFilePath . Git.DiffTree.file
-- Result of an import. ImportUnfinished indicates that some file failed to mkdiff ic removed = ImportableContents
-- be imported. Running again should resume where it left off. { importableContents = diffremoved ++ diffchanged
data ImportResult t , importableHistory = []
= ImportFinished t }
| ImportUnfinished where
diffchanged = map
(\(loc, v) -> (loc, DiffChanged v))
(importableContents ic)
diffremoved = map
(\ti -> (mkloc ti, DiffRemoved))
removed
{- Gets the tree that was last imported from the remote
- (or exported to it if an export happened after the last import).
-}
getLastImportedTree :: Remote -> Annex (Maybe LastImportedTree)
getLastImportedTree remote = do
db <- Export.openDb (Remote.uuid remote)
mtree <- liftIO $ Export.getExportTreeCurrent db
Export.closeDb db
return (LastImportedTree <$> mtree)
{- Downloads all new ContentIdentifiers, or when importcontent is False, {- Downloads all new ContentIdentifiers, or when importcontent is False,
- generates Keys without downloading. - generates Keys without downloading.

View file

@ -75,6 +75,8 @@ git-annex (10.20230408) UNRELEASED; urgency=medium
youtube-dl. Using youtube-dl is now deprecated, and git-annex no longer youtube-dl. Using youtube-dl is now deprecated, and git-annex no longer
tries to parse its output to display download progress tries to parse its output to display download progress
* repair: Fix handling of git ref names on Windows. * repair: Fix handling of git ref names on Windows.
* Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.
* Speed up importing trees from special remotes somewhat by avoiding * Speed up importing trees from special remotes somewhat by avoiding
redundant writes to sqlite database. redundant writes to sqlite database.

View file

@ -335,7 +335,7 @@ seekRemote remote branch msubdir importcontent ci = do
void $ includeCommandAction (listContents remote importtreeconfig ci importabletvar) void $ includeCommandAction (listContents remote importtreeconfig ci importabletvar)
liftIO (atomically (readTVar importabletvar)) >>= \case liftIO (atomically (readTVar importabletvar)) >>= \case
Nothing -> return () Nothing -> return ()
Just importable -> importKeys remote importtreeconfig importcontent False importable >>= \case Just importable -> importChanges remote importtreeconfig importcontent False importable >>= \case
ImportUnfinished -> warning $ UnquotedString $ concat ImportUnfinished -> warning $ UnquotedString $ concat
[ "Failed to import some files from " [ "Failed to import some files from "
, Remote.name remote , Remote.name remote
@ -373,10 +373,10 @@ listContents' remote importtreeconfig ci a =
, err , err
] ]
commitRemote :: Remote -> Branch -> RemoteTrackingBranch -> Maybe Sha -> ImportTreeConfig -> ImportCommitConfig -> ImportableContentsChunkable Annex (Either Sha Key) -> CommandStart commitRemote :: Remote -> Branch -> RemoteTrackingBranch -> Maybe Sha -> ImportTreeConfig -> ImportCommitConfig -> Imported -> CommandStart
commitRemote remote branch tb trackingcommit importtreeconfig importcommitconfig importable = commitRemote remote branch tb trackingcommit importtreeconfig importcommitconfig imported =
starting "update" ai si $ do starting "update" ai si $ do
importcommit <- buildImportCommit remote importtreeconfig importcommitconfig importable importcommit <- buildImportCommit remote importtreeconfig importcommitconfig imported
next $ updateremotetrackingbranch importcommit next $ updateremotetrackingbranch importcommit
where where
ai = ActionItemOther (Just $ UnquotedString $ fromRef $ fromRemoteTrackingBranch tb) ai = ActionItemOther (Just $ UnquotedString $ fromRef $ fromRemoteTrackingBranch tb)

View file

@ -586,9 +586,9 @@ pullThirdPartyPopulated o remote
| otherwise = void $ includeCommandAction $ starting "list" ai si $ | otherwise = void $ includeCommandAction $ starting "list" ai si $
Command.Import.listContents' remote ImportTree (CheckGitIgnore False) go Command.Import.listContents' remote ImportTree (CheckGitIgnore False) go
where where
go (Just importable) = importKeys remote ImportTree False True importable >>= \case go (Just importable) = importChanges remote ImportTree False True importable >>= \case
ImportFinished importablekeys -> do ImportFinished imported -> do
(_imported, updatestate) <- recordImportTree remote ImportTree importablekeys (_t, updatestate) <- recordImportTree remote ImportTree imported
next $ do next $ do
updatestate updatestate
return True return True

View file

@ -10,13 +10,24 @@ own scalability limits with many files.)
Still, it would be good to find some ways to speed it up. Still, it would be good to find some ways to speed it up.
In particular, speeding up repeated imports from the same special remote,
when only a few files have changed, would make it much more useful. It's ok
to pay a somewhat expensive price to import a lot of new files, if updates
are quick after that.
--- ---
A major thing that makes it slow, when a remote contains
many files, is converting from ContentIdentifiers to Keys.
It does a cidsdb lookup for every file, before it knows if the file has
changed or not, which gets slow with a lot of files.
What if it generated a git tree, where each file in the tree is What if it generated a git tree, where each file in the tree is
a sha1 hash of the ContentIdentifier. The tree can just be recorded locally a sha1 hash of the ContentIdentifier. The tree can just be recorded locally
somewhere. It's ok if it gets garbage collected; it's only an optimisation. somewhere. It's ok if it gets garbage collected; it's only an optimisation.
On the next sync, diff from the old to the new tree. It only needs to On the next sync, diff from the old to the new tree. It only needs to
import the changed files! import the changed files, and can avoid the cidsdb lookup for the
unchanged files!
(That is assuming that ContentIdentifiers don't tend to sha1 collide. (That is assuming that ContentIdentifiers don't tend to sha1 collide.
If there was a collision it would fail to import the new file. But it seems If there was a collision it would fail to import the new file. But it seems
@ -24,21 +35,14 @@ reasonable, because git loses data on sha1 collisions anyway, and ContentIdentif
are no more likely to collide than the content of files, and probably less are no more likely to collide than the content of files, and probably less
likely overall..) likely overall..)
How fast can a git tree of say, 10000 files be generated? Is it faster than > I implemented this optimisation. Importing from a special remote that
querying sqlite 10000 times? > has 10000 files, that have all been imported before, and 1 new file
> sped up from 26.06 to 2.59 seconds. An import with no changes sped
> up from 24.3 to 1.99 seconds. Going up to 20000 files, an import with
> no changes sped up from 125.95 to 3.84 seconds.
> (All measured with warm cache.)
Once it knows which files are changed, it still needs to generate the > (Note that I have only implemented this optimisation for imports that
imported tree, which contains both changed and unchanged files. How to > do not include History. So importing from versioned S3 buckets will
handle unchanged files when generating that tree? Current method is > still be slow. It would be possible to do a similar optimisation for
to do a database lookup to convert the ContentIdentifier into a Key, and > History, but it seemed complicated so I punted.) --[[Joey]]
record that in the tree. But those database lookups are the slow thing that
needs to be avoided. Seems like it will need to either use adjustTree, or a
separate index file. (The index file would make importing a History hard.)
----
Another idea would to be use something faster than sqlite to record the cid
to key mappings. Looking up those mappings is the main thing that makes
import slow when only a few files have changed and a large number have not.
--[[Joey]]