cache negative lookups of global numcopies and mincopies

Speeds up eg git-annex sync --content by up to 50%. When it does not need
to transfer or drop anything, it now noops a lot more quickly.

I didn't see anything else in sync --content noop loop that could really
be sped up. It has to cat git objects to keys, stat object files, etc.

Sponsored-by: unqueued on Patreon
This commit is contained in:
Joey Hess 2023-06-06 14:15:47 -04:00
parent 4437e187e6
commit 3c15e0f7a0
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
5 changed files with 38 additions and 6 deletions

View file

@ -183,8 +183,8 @@ data AnnexState = AnnexState
, hashobjecthandle :: Maybe (ResourcePool HashObjectHandle)
, checkattrhandle :: Maybe (ResourcePool CheckAttrHandle)
, checkignorehandle :: Maybe (ResourcePool CheckIgnoreHandle)
, globalnumcopies :: Maybe NumCopies
, globalmincopies :: Maybe MinCopies
, globalnumcopies :: Maybe (Maybe NumCopies)
, globalmincopies :: Maybe (Maybe MinCopies)
, limit :: ExpandableMatcher Annex
, timelimit :: Maybe (Duration, POSIXTime)
, sizelimit :: Maybe (TVar Integer)

View file

@ -79,6 +79,8 @@ git-annex (10.20230408) UNRELEASED; urgency=medium
* Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.
* Some other speedups to importing trees from special remotes.
* Cache negative lookups of global numcopies and mincopies.
Speeds up eg git-annex sync --content by up to 50%.
-- Joey Hess <id@joeyh.name> Sat, 08 Apr 2023 13:57:18 -0400

View file

@ -45,22 +45,22 @@ setGlobalMinCopies new = do
{- Value configured in the numcopies log. Cached for speed. -}
getGlobalNumCopies :: Annex (Maybe NumCopies)
getGlobalNumCopies = maybe globalNumCopiesLoad (return . Just)
getGlobalNumCopies = maybe globalNumCopiesLoad return
=<< Annex.getState Annex.globalnumcopies
{- Value configured in the mincopies log. Cached for speed. -}
getGlobalMinCopies :: Annex (Maybe MinCopies)
getGlobalMinCopies = maybe globalMinCopiesLoad (return . Just)
getGlobalMinCopies = maybe globalMinCopiesLoad return
=<< Annex.getState Annex.globalmincopies
globalNumCopiesLoad :: Annex (Maybe NumCopies)
globalNumCopiesLoad = do
v <- getLog numcopiesLog
Annex.changeState $ \s -> s { Annex.globalnumcopies = v }
Annex.changeState $ \s -> s { Annex.globalnumcopies = Just v }
return v
globalMinCopiesLoad :: Annex (Maybe MinCopies)
globalMinCopiesLoad = do
v <- getLog mincopiesLog
Annex.changeState $ \s -> s { Annex.globalmincopies = v }
Annex.changeState $ \s -> s { Annex.globalmincopies = Just v }
return v

View file

@ -0,0 +1,12 @@
[[!comment format=mdwn
username="joey"
subject="""comment 14"""
date="2023-06-06T17:11:35Z"
content="""
There's only one import in the sync, and your output shows it completed
(with error).
The only other phase of sync that could be run after that and take a lot of
time is content syncing. You would have to have annex.synccontent set
somewhere for sync to do that. Do you?
"""]]

View file

@ -0,0 +1,18 @@
[[!comment format=mdwn
username="joey"
subject="""comment 15"""
date="2023-06-06T17:31:49Z"
content="""
It would make a lot of sense for --content syncing to be what remains slow.
That has to scan over all the files and when it decides that it does not
need to copy the content anywhere, that's a tight loop with no output.
In my repo with 10000 files that was set up by the latest test case,
`git-annex sync` takes 13 seconds, and with --content it takes 61 seconds.
I optimised a numcopies/mincopies lookup away, and that got it
down to 28 seconds.
The cidsdb does not get accessed by the --content scan
in my testing, although there may be other situations where it does.
"""]]