avoid import writing to cidsdb initially
Speed up importing trees from special remotes somewhat by avoiding redundant writes to sqlite database. Before, import would write to both the git-annex branch and also to the sqlite database. But then the next time it was run, needsUpdateFromLog would see the branch had changed, so run updateFromLog, which would make the same writes to the sqlite database a second time. Now import writes only to the git-annex branch. The next time it's run, needsUpdateFromLog sees that the branch has changed and so calls updateFromLog, which updates the sqlite database. Why defer the write to the sqlite database like this? It seems that it could write to the database as it goes, and at the end call recordAnnexBranchTree to indicate that the information in the git-annex branch has all been written to the cidsdb. That would avoid the second import doing extra work. But, there could be other processes running at the same time, and one of them may update the git-annex branch, eg merging a remote git-annex branch into it. Any cids logs on that merged git-annex branch would not be reflected in the cidsdb yet. If the import then called recordAnnexBranchTree, the cidsdb would never get updated with that merged information. I don't think there's a good way to prevent, or to detect that situation. So, it can't call recordAnnexBranchTree at the end. So it might as well wait until the next run and do updateFromLog then. It could instead do updateFromLog at the end, but it's going to check needsUpdateFromLog at the beginning anyway. Note that the database writes were queued, so there is already a cidmap that is used to remember changes that the current process has made. So, omitting database writes can't change the behavior of the current process. Also note that thirdpartypopulatedimport uses recordcidkeyindb, which reflects what it already did. That code path does not use the cidmap, but does not need to query it either. It might be possible to make that code path also only update the git-annex branch and not the db, but I haven't checked. Sponsored-by: Noam Kremen on Patreon
This commit is contained in:
parent
c1e415887a
commit
f6aa097a39
3 changed files with 33 additions and 47 deletions
|
@ -508,7 +508,7 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec
|
|||
let importaction = starting ("import " ++ Remote.name remote) ai si $ do
|
||||
when oldversion $
|
||||
showNote "old version"
|
||||
tryNonAsync (importordownload cidmap db i largematcher) >>= \case
|
||||
tryNonAsync (importordownload cidmap i largematcher) >>= \case
|
||||
Left e -> next $ do
|
||||
warning (UnquotedString (show e))
|
||||
liftIO $ atomically $
|
||||
|
@ -530,7 +530,7 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec
|
|||
Just importkey ->
|
||||
tryNonAsync (importkey loc cid sz nullMeterUpdate) >>= \case
|
||||
Right (Just k) -> do
|
||||
recordcidkey' db cid k
|
||||
recordcidkeyindb db cid k
|
||||
logChange k (Remote.uuid remote) InfoPresent
|
||||
return $ Just (loc, Right k)
|
||||
Right Nothing -> return Nothing
|
||||
|
@ -538,7 +538,7 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec
|
|||
warning (UnquotedString (show e))
|
||||
return Nothing
|
||||
|
||||
importordownload cidmap db (loc, (cid, sz)) largematcher= do
|
||||
importordownload cidmap (loc, (cid, sz)) largematcher= do
|
||||
f <- locworktreefile loc
|
||||
matcher <- largematcher f
|
||||
-- When importing a key is supported, always use it rather
|
||||
|
@ -551,9 +551,9 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec
|
|||
then dodownload
|
||||
else doimport
|
||||
else doimport
|
||||
act cidmap db (loc, (cid, sz)) f matcher
|
||||
act cidmap (loc, (cid, sz)) f matcher
|
||||
|
||||
doimport cidmap db (loc, (cid, sz)) f matcher =
|
||||
doimport cidmap (loc, (cid, sz)) f matcher =
|
||||
case Remote.importKey ia of
|
||||
Nothing -> error "internal" -- checked earlier
|
||||
Just importkey -> do
|
||||
|
@ -570,10 +570,10 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec
|
|||
let bwlimit = remoteAnnexBwLimit (Remote.gitconfig remote)
|
||||
islargefile <- checkMatcher' matcher mi mempty
|
||||
metered Nothing sz bwlimit $ const $ if islargefile
|
||||
then doimportlarge importkey cidmap db loc cid sz f
|
||||
else doimportsmall cidmap db loc cid sz
|
||||
then doimportlarge importkey cidmap loc cid sz f
|
||||
else doimportsmall cidmap loc cid sz
|
||||
|
||||
doimportlarge importkey cidmap db loc cid sz f p =
|
||||
doimportlarge importkey cidmap loc cid sz f p =
|
||||
tryNonAsync importer >>= \case
|
||||
Right (Just (k, True)) -> return $ Just (loc, Right k)
|
||||
Right _ -> return Nothing
|
||||
|
@ -591,7 +591,7 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec
|
|||
Nothing -> return Nothing
|
||||
Just k -> checkSecureHashes k >>= \case
|
||||
Nothing -> do
|
||||
recordcidkey cidmap db cid k
|
||||
recordcidkey cidmap cid k
|
||||
logChange k (Remote.uuid remote) InfoPresent
|
||||
if importcontent
|
||||
then getcontent k
|
||||
|
@ -618,7 +618,7 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec
|
|||
-- The file is small, so is added to git, so while importing
|
||||
-- without content does not retrieve annexed files, it does
|
||||
-- need to retrieve this file.
|
||||
doimportsmall cidmap db loc cid sz p = do
|
||||
doimportsmall cidmap loc cid sz p = do
|
||||
let downloader tmpfile = do
|
||||
(k, _) <- Remote.retrieveExportWithContentIdentifier
|
||||
ia loc [cid] (fromRawFilePath tmpfile)
|
||||
|
@ -626,7 +626,7 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec
|
|||
p
|
||||
case keyGitSha k of
|
||||
Just sha -> do
|
||||
recordcidkey cidmap db cid k
|
||||
recordcidkey cidmap cid k
|
||||
return sha
|
||||
Nothing -> error "internal"
|
||||
checkDiskSpaceToGet tmpkey Nothing $
|
||||
|
@ -640,7 +640,7 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec
|
|||
tmpkey = importKey cid sz
|
||||
mkkey tmpfile = gitShaKey <$> hashFile tmpfile
|
||||
|
||||
dodownload cidmap db (loc, (cid, sz)) f matcher = do
|
||||
dodownload cidmap (loc, (cid, sz)) f matcher = do
|
||||
let af = AssociatedFile (Just f)
|
||||
let downloader tmpfile p = do
|
||||
(k, _) <- Remote.retrieveExportWithContentIdentifier
|
||||
|
@ -651,12 +651,12 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec
|
|||
Nothing -> do
|
||||
ok <- moveAnnex k af tmpfile
|
||||
when ok $ do
|
||||
recordcidkey cidmap db cid k
|
||||
recordcidkey cidmap cid k
|
||||
logStatus k InfoPresent
|
||||
logChange k (Remote.uuid remote) InfoPresent
|
||||
return (Right k, ok)
|
||||
Just sha -> do
|
||||
recordcidkey cidmap db cid k
|
||||
recordcidkey cidmap cid k
|
||||
return (Left sha, True)
|
||||
let rundownload tmpfile p = tryNonAsync (downloader tmpfile p) >>= \case
|
||||
Right (v, True) -> return $ Just (loc, v)
|
||||
|
@ -706,12 +706,18 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec
|
|||
maybeToList . M.lookup cid <$> readTVar cidmap
|
||||
l -> return l
|
||||
|
||||
recordcidkey cidmap db cid k = do
|
||||
recordcidkey cidmap cid k = do
|
||||
liftIO $ atomically $ modifyTVar' cidmap $
|
||||
M.insert cid k
|
||||
recordcidkey' db cid k
|
||||
recordcidkey' db cid k = do
|
||||
-- Only record in log now; the database will be updated
|
||||
-- later from the log, and the cidmap will be used for now.
|
||||
recordcidkeyinlog cid k
|
||||
|
||||
recordcidkeyindb db cid k = do
|
||||
liftIO $ CIDDb.recordContentIdentifier db rs cid k
|
||||
recordcidkeyinlog cid k
|
||||
|
||||
recordcidkeyinlog cid k =
|
||||
CIDLog.recordContentIdentifier rs cid k
|
||||
|
||||
rs = Remote.remoteStateHandle remote
|
||||
|
|
|
@ -75,6 +75,8 @@ git-annex (10.20230408) UNRELEASED; urgency=medium
|
|||
youtube-dl. Using youtube-dl is now deprecated, and git-annex no longer
|
||||
tries to parse its output to display download progress
|
||||
* repair: Fix handling of git ref names on Windows.
|
||||
* Speed up importing trees from special remotes somewhat by avoiding
|
||||
redundant writes to sqlite database.
|
||||
|
||||
-- Joey Hess <id@joeyh.name> Sat, 08 Apr 2023 13:57:18 -0400
|
||||
|
||||
|
|
|
@ -3,36 +3,14 @@
|
|||
subject="""comment 2"""
|
||||
date="2023-05-30T18:49:34Z"
|
||||
content="""
|
||||
I think I see why a second sync re-writes the cidsdb with information from
|
||||
the git-annex branch.
|
||||
> The other hit is recordContentIdentifier, which happens for
|
||||
> each recorded cid, due to updateFromLog. That seems unnecessary, because
|
||||
> the previous sync already recorded all the cids.
|
||||
|
||||
The first sync does write to the cidsdb at the same time it writes to the
|
||||
git-annex branch. So, it seems that it could at the end call
|
||||
recordAnnexBranchTree to indicate that the information in the git-annex
|
||||
branch has all been written to the cidsdb. That would avoid the second sync
|
||||
doing extra work.
|
||||
I was able to eliminate that extra work. Now the first sync does not write
|
||||
to the cidsdb but only to the git-annex branch, and the second sync does
|
||||
the necessary work of updating the cidsdb from the git-annex branch.
|
||||
|
||||
But, there could be other processes running at the same time, and one of
|
||||
them may update the git-annex branch, eg merging a remote git-annex branch
|
||||
into it. Any cids logs on that merged git-annex branch would not be
|
||||
reflected in the cidsdb yet. If the sync then called
|
||||
recordAnnexBranchTree, the cidsdb would never get updated with that merged
|
||||
information.
|
||||
|
||||
I don't think there's a good way to prevent, or to detect that situation.
|
||||
So, it can't call recordAnnexBranchTree at the end, and has to do extra
|
||||
work in updateFromLog at the beginning.
|
||||
|
||||
What it could do is, only record a cid to the git-annex branch, not to the
|
||||
cidsdb. Then the updateFromLog would not be doing extra work, but necessary
|
||||
work. But, it also needs to read cids from the db during import, and if it
|
||||
doesn't record a cid there, it won't know when a later file has the same
|
||||
cid. So it will re-import it. Which for other special remotes than
|
||||
directory, means an expensive second download of the content.
|
||||
|
||||
Anyway, the extra work of re-writing the cidsdb is only done on the sync
|
||||
immediately after the one that did import some new files. And it only
|
||||
re-writes cids for the new files, not for unchanged files.
|
||||
|
||||
I'm not sure that this extra work is what the bug reporter was complaining about though.
|
||||
I'm not sure that extra work is what the bug reporter was complaining
|
||||
about though.
|
||||
"""]]
|
||||
|
|
Loading…
Reference in a new issue