use importChanges optimisation
Large speed up to importing trees from special remotes that contain a lot of files, by only processing changed files. Benchmarks: Importing from a special remote that has 10000 files, that have all been imported before, and 1 new file sped up from 26.06 to 2.59 seconds. An import with no change and 10000 unchanged files sped up from 24.3 to 1.99 seconds. Going up to 20000 files, an import with no changes sped up from 125.95 to 3.84 seconds. Sponsored-by: k0ld on Patreon
This commit is contained in:
parent
029b08f54b
commit
40017089f2
5 changed files with 174 additions and 86 deletions
204
Annex/Import.hs
204
Annex/Import.hs
|
@ -15,6 +15,7 @@ module Annex.Import (
|
||||||
recordImportTree,
|
recordImportTree,
|
||||||
canImportKeys,
|
canImportKeys,
|
||||||
ImportResult(..),
|
ImportResult(..),
|
||||||
|
Imported,
|
||||||
importChanges,
|
importChanges,
|
||||||
importKeys,
|
importKeys,
|
||||||
makeImportMatcher,
|
makeImportMatcher,
|
||||||
|
@ -41,6 +42,7 @@ import Annex.RemoteTrackingBranch
|
||||||
import Annex.HashObject
|
import Annex.HashObject
|
||||||
import Annex.Transfer
|
import Annex.Transfer
|
||||||
import Annex.CheckIgnore
|
import Annex.CheckIgnore
|
||||||
|
import Annex.CatFile
|
||||||
import Annex.VectorClock
|
import Annex.VectorClock
|
||||||
import Command
|
import Command
|
||||||
import Backend
|
import Backend
|
||||||
|
@ -104,9 +106,9 @@ buildImportCommit
|
||||||
:: Remote
|
:: Remote
|
||||||
-> ImportTreeConfig
|
-> ImportTreeConfig
|
||||||
-> ImportCommitConfig
|
-> ImportCommitConfig
|
||||||
-> ImportableContentsChunkable Annex (Either Sha Key)
|
-> Imported
|
||||||
-> Annex (Maybe Ref)
|
-> Annex (Maybe Ref)
|
||||||
buildImportCommit remote importtreeconfig importcommitconfig importable =
|
buildImportCommit remote importtreeconfig importcommitconfig imported =
|
||||||
case importCommitTracking importcommitconfig of
|
case importCommitTracking importcommitconfig of
|
||||||
Nothing -> go Nothing
|
Nothing -> go Nothing
|
||||||
Just trackingcommit -> inRepo (Git.Ref.tree trackingcommit) >>= \case
|
Just trackingcommit -> inRepo (Git.Ref.tree trackingcommit) >>= \case
|
||||||
|
@ -114,8 +116,8 @@ buildImportCommit remote importtreeconfig importcommitconfig importable =
|
||||||
Just _ -> go (Just trackingcommit)
|
Just _ -> go (Just trackingcommit)
|
||||||
where
|
where
|
||||||
go trackingcommit = do
|
go trackingcommit = do
|
||||||
(imported, updatestate) <- recordImportTree remote importtreeconfig importable
|
(importedtree, updatestate) <- recordImportTree remote importtreeconfig imported
|
||||||
buildImportCommit' remote importcommitconfig trackingcommit imported >>= \case
|
buildImportCommit' remote importcommitconfig trackingcommit importedtree >>= \case
|
||||||
Just finalcommit -> do
|
Just finalcommit -> do
|
||||||
updatestate
|
updatestate
|
||||||
return (Just finalcommit)
|
return (Just finalcommit)
|
||||||
|
@ -129,11 +131,11 @@ buildImportCommit remote importtreeconfig importcommitconfig importable =
|
||||||
recordImportTree
|
recordImportTree
|
||||||
:: Remote
|
:: Remote
|
||||||
-> ImportTreeConfig
|
-> ImportTreeConfig
|
||||||
-> ImportableContentsChunkable Annex (Either Sha Key)
|
-> Imported
|
||||||
-> Annex (History Sha, Annex ())
|
-> Annex (History Sha, Annex ())
|
||||||
recordImportTree remote importtreeconfig importable = do
|
recordImportTree remote importtreeconfig imported = do
|
||||||
imported@(History finaltree _) <- buildImportTrees basetree subdir importable
|
importedtree@(History finaltree _) <- buildImportTrees basetree subdir imported
|
||||||
return (imported, updatestate finaltree)
|
return (importedtree, updatestate finaltree)
|
||||||
where
|
where
|
||||||
basetree = case importtreeconfig of
|
basetree = case importtreeconfig of
|
||||||
ImportTree -> emptyTree
|
ImportTree -> emptyTree
|
||||||
|
@ -265,34 +267,69 @@ buildImportCommit' remote importcommitconfig mtrackingcommit imported@(History t
|
||||||
parents <- mapM (mknewcommits oldhc old) (S.toList hs)
|
parents <- mapM (mknewcommits oldhc old) (S.toList hs)
|
||||||
mkcommit parents importedtree
|
mkcommit parents importedtree
|
||||||
|
|
||||||
{- Builds a history of git trees reflecting the ImportableContents.
|
{- Builds a history of git trees for an import.
|
||||||
-
|
-
|
||||||
- When a subdir is provided, imported tree is grafted into the basetree at
|
- When a subdir is provided, the imported tree is grafted into
|
||||||
- that location, replacing any object that was there.
|
- the basetree at that location, replacing any object that was there.
|
||||||
-}
|
-}
|
||||||
buildImportTrees
|
buildImportTrees
|
||||||
:: Ref
|
:: Ref
|
||||||
-> Maybe TopFilePath
|
-> Maybe TopFilePath
|
||||||
-> ImportableContentsChunkable Annex (Either Sha Key)
|
-> Imported
|
||||||
-> Annex (History Sha)
|
-> Annex (History Sha)
|
||||||
buildImportTrees = buildImportTreesGeneric convertImportTree
|
buildImportTrees basetree msubdir (ImportedFull imported) =
|
||||||
|
buildImportTreesGeneric convertImportTree basetree msubdir imported
|
||||||
|
buildImportTrees basetree msubdir (ImportedDiff (LastImportedTree oldtree) imported) = do
|
||||||
|
importtree <- if null (importableContents imported)
|
||||||
|
then pure oldtree
|
||||||
|
else applydiff
|
||||||
|
repo <- Annex.gitRepo
|
||||||
|
t <- withMkTreeHandle repo $
|
||||||
|
graftImportTree basetree msubdir importtree
|
||||||
|
-- Diffing is not currently implemented when the history is not empty.
|
||||||
|
return (History t mempty)
|
||||||
|
where
|
||||||
|
applydiff = do
|
||||||
|
let (removed, new) = partition isremoved
|
||||||
|
(importableContents imported)
|
||||||
|
newtreeitems <- catMaybes <$> mapM mktreeitem new
|
||||||
|
let removedfiles = map (mkloc . fst) removed
|
||||||
|
inRepo $ adjustTree
|
||||||
|
(pure . Just)
|
||||||
|
-- ^ keep files that are not added/removed the same
|
||||||
|
newtreeitems
|
||||||
|
(\_oldti newti -> newti)
|
||||||
|
-- ^ prefer newly added version of file
|
||||||
|
removedfiles
|
||||||
|
oldtree
|
||||||
|
|
||||||
|
mktreeitem (loc, DiffChanged v) =
|
||||||
|
Just <$> mkImportTreeItem msubdir loc v
|
||||||
|
mktreeitem (_, DiffRemoved) =
|
||||||
|
pure Nothing
|
||||||
|
|
||||||
|
mkloc = asTopFilePath . fromImportLocation
|
||||||
|
|
||||||
|
isremoved (_, v) = v == DiffRemoved
|
||||||
|
|
||||||
convertImportTree :: Maybe TopFilePath -> [(ImportLocation, Either Sha Key)] -> Annex Tree
|
convertImportTree :: Maybe TopFilePath -> [(ImportLocation, Either Sha Key)] -> Annex Tree
|
||||||
convertImportTree msubdir ls = treeItemsToTree <$> mapM mktreeitem ls
|
convertImportTree msubdir ls =
|
||||||
|
treeItemsToTree <$> mapM (uncurry $ mkImportTreeItem msubdir) ls
|
||||||
|
|
||||||
|
mkImportTreeItem :: Maybe TopFilePath -> ImportLocation -> Either Sha Key -> Annex TreeItem
|
||||||
|
mkImportTreeItem msubdir loc v = case v of
|
||||||
|
Right k -> do
|
||||||
|
relf <- fromRepo $ fromTopFilePath topf
|
||||||
|
symlink <- calcRepo $ gitAnnexLink relf k
|
||||||
|
linksha <- hashSymlink symlink
|
||||||
|
return $ TreeItem treepath (fromTreeItemType TreeSymlink) linksha
|
||||||
|
Left sha ->
|
||||||
|
return $ TreeItem treepath (fromTreeItemType TreeFile) sha
|
||||||
where
|
where
|
||||||
mktreeitem (loc, v) = case v of
|
lf = fromImportLocation loc
|
||||||
Right k -> do
|
treepath = asTopFilePath lf
|
||||||
relf <- fromRepo $ fromTopFilePath topf
|
topf = asTopFilePath $
|
||||||
symlink <- calcRepo $ gitAnnexLink relf k
|
maybe lf (\sd -> getTopFilePath sd P.</> lf) msubdir
|
||||||
linksha <- hashSymlink symlink
|
|
||||||
return $ TreeItem treepath (fromTreeItemType TreeSymlink) linksha
|
|
||||||
Left sha ->
|
|
||||||
return $ TreeItem treepath (fromTreeItemType TreeFile) sha
|
|
||||||
where
|
|
||||||
lf = fromImportLocation loc
|
|
||||||
treepath = asTopFilePath lf
|
|
||||||
topf = asTopFilePath $
|
|
||||||
maybe lf (\sd -> getTopFilePath sd P.</> lf) msubdir
|
|
||||||
|
|
||||||
{- Builds a history of git trees using ContentIdentifiers.
|
{- Builds a history of git trees using ContentIdentifiers.
|
||||||
-
|
-
|
||||||
|
@ -426,11 +463,24 @@ canImportKeys remote importcontent =
|
||||||
where
|
where
|
||||||
ia = Remote.importActions remote
|
ia = Remote.importActions remote
|
||||||
|
|
||||||
|
-- Result of an import. ImportUnfinished indicates that some file failed to
|
||||||
|
-- be imported. Running again should resume where it left off.
|
||||||
|
data ImportResult t
|
||||||
|
= ImportFinished t
|
||||||
|
| ImportUnfinished
|
||||||
|
|
||||||
data Diffed t
|
data Diffed t
|
||||||
= DiffChanged t
|
= DiffChanged t
|
||||||
| DiffRemoved
|
| DiffRemoved
|
||||||
|
deriving (Eq)
|
||||||
|
|
||||||
{- Diffs between the current and previous ContentIdentifier trees, and
|
data Imported
|
||||||
|
= ImportedFull (ImportableContentsChunkable Annex (Either Sha Key))
|
||||||
|
| ImportedDiff LastImportedTree (ImportableContents (Diffed (Either Sha Key)))
|
||||||
|
|
||||||
|
newtype LastImportedTree = LastImportedTree Sha
|
||||||
|
|
||||||
|
{- Diffs between the previous and current ContentIdentifier trees, and
|
||||||
- runs importKeys on only the changed files.
|
- runs importKeys on only the changed files.
|
||||||
-
|
-
|
||||||
- This will download the same content as if importKeys were run on all
|
- This will download the same content as if importKeys were run on all
|
||||||
|
@ -447,9 +497,7 @@ importChanges
|
||||||
-> Bool
|
-> Bool
|
||||||
-> Bool
|
-> Bool
|
||||||
-> ImportableContentsChunkable Annex (ContentIdentifier, ByteSize)
|
-> ImportableContentsChunkable Annex (ContentIdentifier, ByteSize)
|
||||||
-> Annex (ImportResult (Either
|
-> Annex (ImportResult Imported)
|
||||||
(ImportableContentsChunkable Annex (Either Sha Key))
|
|
||||||
(ImportableContentsChunkable Annex (Diffed (Either Sha Key)))))
|
|
||||||
importChanges remote importtreeconfig importcontent thirdpartypopulated importablecontents = do
|
importChanges remote importtreeconfig importcontent thirdpartypopulated importablecontents = do
|
||||||
((History currcidtree currhistory), cidtreemap) <- buildContentIdentifierTree importablecontents
|
((History currcidtree currhistory), cidtreemap) <- buildContentIdentifierTree importablecontents
|
||||||
-- diffimport below does not handle history, so when there is
|
-- diffimport below does not handle history, so when there is
|
||||||
|
@ -459,56 +507,90 @@ importChanges remote importtreeconfig importcontent thirdpartypopulated importab
|
||||||
else do
|
else do
|
||||||
getContentIdentifierTree (Remote.uuid remote) >>= \case
|
getContentIdentifierTree (Remote.uuid remote) >>= \case
|
||||||
Nothing -> fullimport currcidtree
|
Nothing -> fullimport currcidtree
|
||||||
Just prevcidtree -> diffimport cidtreemap prevcidtree currcidtree
|
Just prevcidtree -> candiffimport prevcidtree >>= \case
|
||||||
|
Nothing -> fullimport currcidtree
|
||||||
|
Just lastimportedtree -> diffimport cidtreemap prevcidtree currcidtree lastimportedtree
|
||||||
where
|
where
|
||||||
remember = recordContentIdentifierTree (Remote.uuid remote)
|
remember = recordContentIdentifierTree (Remote.uuid remote)
|
||||||
|
|
||||||
|
-- In order to use a diff, the previous ContentIdentifier tree must
|
||||||
|
-- not have been garbage collected. Which can happen since there
|
||||||
|
-- are no git refs to it.
|
||||||
|
--
|
||||||
|
-- Also, a tree must have been imported before, and that tree must
|
||||||
|
-- also have not been garbage collected (which is less likely to
|
||||||
|
-- happen due to the remote tracking branch).
|
||||||
|
candiffimport prevcidtree =
|
||||||
|
catObjectMetaData prevcidtree >>= \case
|
||||||
|
Nothing -> return Nothing
|
||||||
|
Just _ -> getLastImportedTree remote >>= \case
|
||||||
|
Nothing -> return Nothing
|
||||||
|
Just lastimported@(LastImportedTree t) ->
|
||||||
|
ifM (isJust <$> catObjectMetaData t)
|
||||||
|
( return (Just lastimported)
|
||||||
|
, return Nothing
|
||||||
|
)
|
||||||
|
|
||||||
fullimport currcidtree =
|
fullimport currcidtree =
|
||||||
importKeys remote importtreeconfig importcontent thirdpartypopulated importablecontents >>= \case
|
importKeys remote importtreeconfig importcontent thirdpartypopulated importablecontents >>= \case
|
||||||
ImportUnfinished -> return ImportUnfinished
|
ImportUnfinished -> return ImportUnfinished
|
||||||
ImportFinished r -> do
|
ImportFinished r -> do
|
||||||
remember currcidtree
|
remember currcidtree
|
||||||
return $ ImportFinished $ Left r
|
return $ ImportFinished $ ImportedFull r
|
||||||
|
|
||||||
diffimport cidtreemap prevcidtree currcidtree = do
|
diffimport cidtreemap prevcidtree currcidtree lastimportedtree = do
|
||||||
(diff, cleanup) <- inRepo $ Git.DiffTree.diffTreeRecursive currcidtree prevcidtree
|
(diff, cleanup) <- inRepo $ Git.DiffTree.diffTreeRecursive
|
||||||
let (removed, changed) = partition (\ti -> Git.DiffTree.dstsha ti `elem` nullShas) diff
|
prevcidtree
|
||||||
let mkloc = mkImportLocation . getTopFilePath . Git.DiffTree.file
|
currcidtree
|
||||||
|
let (removed, changed) = partition isremoval diff
|
||||||
let mkicchanged ti = do
|
let mkicchanged ti = do
|
||||||
v <- M.lookup (Git.DiffTree.dstsha ti) cidtreemap
|
v <- M.lookup (Git.DiffTree.dstsha ti) cidtreemap
|
||||||
return (mkloc ti, v)
|
return (mkloc ti, v)
|
||||||
let ic = ImportableContentsComplete $ ImportableContents
|
let ic = ImportableContentsComplete $ ImportableContents
|
||||||
{ importableContents = mapMaybe mkicchanged changed
|
{ importableContents = mapMaybe mkicchanged changed
|
||||||
, importableHistory = []
|
, importableHistory = []
|
||||||
}
|
}
|
||||||
importKeys remote importtreeconfig importcontent thirdpartypopulated ic >>= \case
|
importKeys remote importtreeconfig importcontent thirdpartypopulated ic >>= \case
|
||||||
ImportUnfinished -> do
|
ImportUnfinished -> do
|
||||||
void $ liftIO cleanup
|
void $ liftIO cleanup
|
||||||
return ImportUnfinished
|
return ImportUnfinished
|
||||||
ImportFinished (ImportableContentsComplete ic') -> liftIO cleanup >>= \case
|
ImportFinished (ImportableContentsComplete ic') ->
|
||||||
False -> return ImportUnfinished
|
liftIO cleanup >>= \case
|
||||||
True -> do
|
False -> return ImportUnfinished
|
||||||
remember currcidtree
|
True -> do
|
||||||
let diffchanged = map
|
remember currcidtree
|
||||||
(\(loc, v) -> (loc, DiffChanged v))
|
return $ ImportFinished $
|
||||||
(importableContents ic')
|
ImportedDiff lastimportedtree
|
||||||
let diffremoved = map
|
(mkdiff ic' removed)
|
||||||
(\ti -> (mkloc ti, DiffRemoved))
|
|
||||||
removed
|
|
||||||
let ic'' = ImportableContentsComplete $ ImportableContents
|
|
||||||
{ importableContents = diffremoved ++ diffchanged
|
|
||||||
, importableHistory = []
|
|
||||||
}
|
|
||||||
return $ ImportFinished $ Right ic''
|
|
||||||
-- importKeys is not passed ImportableContentsChunked
|
-- importKeys is not passed ImportableContentsChunked
|
||||||
-- above, so it cannot return it
|
-- above, so it cannot return it
|
||||||
ImportFinished (ImportableContentsChunked {}) -> error "internal"
|
ImportFinished (ImportableContentsChunked {}) -> error "internal"
|
||||||
|
|
||||||
|
isremoval ti = Git.DiffTree.dstsha ti `elem` nullShas
|
||||||
|
|
||||||
|
mkloc = mkImportLocation . getTopFilePath . Git.DiffTree.file
|
||||||
|
|
||||||
-- Result of an import. ImportUnfinished indicates that some file failed to
|
mkdiff ic removed = ImportableContents
|
||||||
-- be imported. Running again should resume where it left off.
|
{ importableContents = diffremoved ++ diffchanged
|
||||||
data ImportResult t
|
, importableHistory = []
|
||||||
= ImportFinished t
|
}
|
||||||
| ImportUnfinished
|
where
|
||||||
|
diffchanged = map
|
||||||
|
(\(loc, v) -> (loc, DiffChanged v))
|
||||||
|
(importableContents ic)
|
||||||
|
diffremoved = map
|
||||||
|
(\ti -> (mkloc ti, DiffRemoved))
|
||||||
|
removed
|
||||||
|
|
||||||
|
{- Gets the tree that was last imported from the remote
|
||||||
|
- (or exported to it if an export happened after the last import).
|
||||||
|
-}
|
||||||
|
getLastImportedTree :: Remote -> Annex (Maybe LastImportedTree)
|
||||||
|
getLastImportedTree remote = do
|
||||||
|
db <- Export.openDb (Remote.uuid remote)
|
||||||
|
mtree <- liftIO $ Export.getExportTreeCurrent db
|
||||||
|
Export.closeDb db
|
||||||
|
return (LastImportedTree <$> mtree)
|
||||||
|
|
||||||
{- Downloads all new ContentIdentifiers, or when importcontent is False,
|
{- Downloads all new ContentIdentifiers, or when importcontent is False,
|
||||||
- generates Keys without downloading.
|
- generates Keys without downloading.
|
||||||
|
|
|
@ -75,6 +75,8 @@ git-annex (10.20230408) UNRELEASED; urgency=medium
|
||||||
youtube-dl. Using youtube-dl is now deprecated, and git-annex no longer
|
youtube-dl. Using youtube-dl is now deprecated, and git-annex no longer
|
||||||
tries to parse its output to display download progress
|
tries to parse its output to display download progress
|
||||||
* repair: Fix handling of git ref names on Windows.
|
* repair: Fix handling of git ref names on Windows.
|
||||||
|
* Large speed up to importing trees from special remotes that contain a lot
|
||||||
|
of files, by only processing changed files.
|
||||||
* Speed up importing trees from special remotes somewhat by avoiding
|
* Speed up importing trees from special remotes somewhat by avoiding
|
||||||
redundant writes to sqlite database.
|
redundant writes to sqlite database.
|
||||||
|
|
||||||
|
|
|
@ -335,7 +335,7 @@ seekRemote remote branch msubdir importcontent ci = do
|
||||||
void $ includeCommandAction (listContents remote importtreeconfig ci importabletvar)
|
void $ includeCommandAction (listContents remote importtreeconfig ci importabletvar)
|
||||||
liftIO (atomically (readTVar importabletvar)) >>= \case
|
liftIO (atomically (readTVar importabletvar)) >>= \case
|
||||||
Nothing -> return ()
|
Nothing -> return ()
|
||||||
Just importable -> importKeys remote importtreeconfig importcontent False importable >>= \case
|
Just importable -> importChanges remote importtreeconfig importcontent False importable >>= \case
|
||||||
ImportUnfinished -> warning $ UnquotedString $ concat
|
ImportUnfinished -> warning $ UnquotedString $ concat
|
||||||
[ "Failed to import some files from "
|
[ "Failed to import some files from "
|
||||||
, Remote.name remote
|
, Remote.name remote
|
||||||
|
@ -373,10 +373,10 @@ listContents' remote importtreeconfig ci a =
|
||||||
, err
|
, err
|
||||||
]
|
]
|
||||||
|
|
||||||
commitRemote :: Remote -> Branch -> RemoteTrackingBranch -> Maybe Sha -> ImportTreeConfig -> ImportCommitConfig -> ImportableContentsChunkable Annex (Either Sha Key) -> CommandStart
|
commitRemote :: Remote -> Branch -> RemoteTrackingBranch -> Maybe Sha -> ImportTreeConfig -> ImportCommitConfig -> Imported -> CommandStart
|
||||||
commitRemote remote branch tb trackingcommit importtreeconfig importcommitconfig importable =
|
commitRemote remote branch tb trackingcommit importtreeconfig importcommitconfig imported =
|
||||||
starting "update" ai si $ do
|
starting "update" ai si $ do
|
||||||
importcommit <- buildImportCommit remote importtreeconfig importcommitconfig importable
|
importcommit <- buildImportCommit remote importtreeconfig importcommitconfig imported
|
||||||
next $ updateremotetrackingbranch importcommit
|
next $ updateremotetrackingbranch importcommit
|
||||||
where
|
where
|
||||||
ai = ActionItemOther (Just $ UnquotedString $ fromRef $ fromRemoteTrackingBranch tb)
|
ai = ActionItemOther (Just $ UnquotedString $ fromRef $ fromRemoteTrackingBranch tb)
|
||||||
|
|
|
@ -586,9 +586,9 @@ pullThirdPartyPopulated o remote
|
||||||
| otherwise = void $ includeCommandAction $ starting "list" ai si $
|
| otherwise = void $ includeCommandAction $ starting "list" ai si $
|
||||||
Command.Import.listContents' remote ImportTree (CheckGitIgnore False) go
|
Command.Import.listContents' remote ImportTree (CheckGitIgnore False) go
|
||||||
where
|
where
|
||||||
go (Just importable) = importKeys remote ImportTree False True importable >>= \case
|
go (Just importable) = importChanges remote ImportTree False True importable >>= \case
|
||||||
ImportFinished importablekeys -> do
|
ImportFinished imported -> do
|
||||||
(_imported, updatestate) <- recordImportTree remote ImportTree importablekeys
|
(_t, updatestate) <- recordImportTree remote ImportTree imported
|
||||||
next $ do
|
next $ do
|
||||||
updatestate
|
updatestate
|
||||||
return True
|
return True
|
||||||
|
|
|
@ -10,13 +10,24 @@ own scalability limits with many files.)
|
||||||
|
|
||||||
Still, it would be good to find some ways to speed it up.
|
Still, it would be good to find some ways to speed it up.
|
||||||
|
|
||||||
|
In particular, speeding up repeated imports from the same special remote,
|
||||||
|
when only a few files have changed, would make it much more useful. It's ok
|
||||||
|
to pay a somewhat expensive price to import a lot of new files, if updates
|
||||||
|
are quick after that.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
A major thing that makes it slow, when a remote contains
|
||||||
|
many files, is converting from ContentIdentifiers to Keys.
|
||||||
|
It does a cidsdb lookup for every file, before it knows if the file has
|
||||||
|
changed or not, which gets slow with a lot of files.
|
||||||
|
|
||||||
What if it generated a git tree, where each file in the tree is
|
What if it generated a git tree, where each file in the tree is
|
||||||
a sha1 hash of the ContentIdentifier. The tree can just be recorded locally
|
a sha1 hash of the ContentIdentifier. The tree can just be recorded locally
|
||||||
somewhere. It's ok if it gets garbage collected; it's only an optimisation.
|
somewhere. It's ok if it gets garbage collected; it's only an optimisation.
|
||||||
On the next sync, diff from the old to the new tree. It only needs to
|
On the next sync, diff from the old to the new tree. It only needs to
|
||||||
import the changed files!
|
import the changed files, and can avoid the cidsdb lookup for the
|
||||||
|
unchanged files!
|
||||||
|
|
||||||
(That is assuming that ContentIdentifiers don't tend to sha1 collide.
|
(That is assuming that ContentIdentifiers don't tend to sha1 collide.
|
||||||
If there was a collision it would fail to import the new file. But it seems
|
If there was a collision it would fail to import the new file. But it seems
|
||||||
|
@ -24,21 +35,14 @@ reasonable, because git loses data on sha1 collisions anyway, and ContentIdentif
|
||||||
are no more likely to collide than the content of files, and probably less
|
are no more likely to collide than the content of files, and probably less
|
||||||
likely overall..)
|
likely overall..)
|
||||||
|
|
||||||
How fast can a git tree of say, 10000 files be generated? Is it faster than
|
> I implemented this optimisation. Importing from a special remote that
|
||||||
querying sqlite 10000 times?
|
> has 10000 files, that have all been imported before, and 1 new file
|
||||||
|
> sped up from 26.06 to 2.59 seconds. An import with no changes sped
|
||||||
|
> up from 24.3 to 1.99 seconds. Going up to 20000 files, an import with
|
||||||
|
> no changes sped up from 125.95 to 3.84 seconds.
|
||||||
|
> (All measured with warm cache.)
|
||||||
|
|
||||||
Once it knows which files are changed, it still needs to generate the
|
> (Note that I have only implemented this optimisation for imports that
|
||||||
imported tree, which contains both changed and unchanged files. How to
|
> do not include History. So importing from versioned S3 buckets will
|
||||||
handle unchanged files when generating that tree? Current method is
|
> still be slow. It would be possible to do a similar optimisation for
|
||||||
to do a database lookup to convert the ContentIdentifier into a Key, and
|
> History, but it seemed complicated so I punted.) --[[Joey]]
|
||||||
record that in the tree. But those database lookups are the slow thing that
|
|
||||||
needs to be avoided. Seems like it will need to either use adjustTree, or a
|
|
||||||
separate index file. (The index file would make importing a History hard.)
|
|
||||||
|
|
||||||
----
|
|
||||||
|
|
||||||
Another idea would to be use something faster than sqlite to record the cid
|
|
||||||
to key mappings. Looking up those mappings is the main thing that makes
|
|
||||||
import slow when only a few files have changed and a large number have not.
|
|
||||||
|
|
||||||
--[[Joey]]
|
|
||||||
|
|
Loading…
Add table
Reference in a new issue