dynamically disable filter-process in restagePointerFile when it would be slower

Based on my earlier benchmark, I have a rough cost model for how
expensive it is for git-annex smudge to be run on a file, vs
how expensive it is for a gigabyte of a file's content to be read and
piped through to filter-process.

So, using that cost model, it can decide if using filter-process will
be more or less expensive than running the smudge filter on the files to
be restaged.

It turned out to be *really* annoying to temporarily disable
filter-process. I did find a way, but urk, this is horrible. Notice
that, if it's interrupted with it disabled, it will remain disabled
until the next time restagePointerFile runs. Which could be some time
later. If the user runs `git add` or `git checkout` on a lot of small
files before that, they will see slower than expected performance.

(This commit also deletes where I wrote down the benchmark results
earlier.)

Sponsored-by: Noam Kremen on Patreon
This commit is contained in:
Joey Hess 2021-11-08 15:55:27 -04:00
parent 054c803f8d
commit a0758bdd10
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
6 changed files with 71 additions and 108 deletions

View file

@ -12,7 +12,7 @@
- Licensed under the GNU AGPL version 3 or higher.
-}
{-# LANGUAGE CPP, BangPatterns #-}
{-# LANGUAGE CPP, BangPatterns, OverloadedStrings #-}
module Annex.Link where
@ -35,6 +35,7 @@ import Utility.FileMode
import Utility.InodeCache
import Utility.Tmp.Dir
import Utility.CopyFile
import Utility.Tuple
import qualified Database.Keys.Handle
import qualified Utility.RawFilePath as R
@ -89,9 +90,9 @@ getAnnexLinkTarget' file coresymlinks = if coresymlinks
else
-- If there are any NUL or newline
-- characters, or whitespace, we
-- certianly don't have a symlink to a
-- certainly don't have a symlink to a
-- git-annex key.
if any (`S8.elem` s) "\0\n\r \t"
if any (`S8.elem` s) ("\0\n\r \t" :: [Char])
then mempty
else s
@ -189,7 +190,7 @@ restagePointerFile (Restage True) f orig = withTSDelta $ \tsd ->
-- fails on "../../repo/path/file" when cwd is not in the repo
-- being acted on. Avoid these problems with an absolute path.
absf <- liftIO $ absPath f
Annex.Queue.addInternalAction runner [(absf, isunmodified tsd)]
Annex.Queue.addInternalAction runner [(absf, isunmodified tsd, inodeCacheFileSize orig)]
where
isunmodified tsd = genInodeCache f tsd >>= return . \case
Nothing -> False
@ -226,9 +227,9 @@ restagePointerFile (Restage True) f orig = withTSDelta $ \tsd ->
[ Param "-c"
, Param $ "core.safecrlf=" ++ boolConfig False
] }
runsGitAnnexChildProcessViaGit' r'' $ \r''' ->
configfilterprocess l $ runsGitAnnexChildProcessViaGit' r'' $ \r''' ->
liftIO $ Git.UpdateIndex.refreshIndex r''' $ \feed ->
forM_ l $ \(f', checkunmodified) ->
forM_ l $ \(f', checkunmodified, _) ->
whenM checkunmodified $
feed f'
let replaceindex = catchBoolIO $ do
@ -239,6 +240,49 @@ restagePointerFile (Restage True) f orig = withTSDelta $ \tsd ->
<&&> liftIO replaceindex
unless ok showwarning
bracket lockindex unlockindex go
{- filter.annex.process configured to use git-annex filter-process
- is sometimes faster and sometimes slower than using
- git-annex smudge. The latter is run once per file, while
- the former has the content of files piped to it.
-}
filterprocessfaster l =
let numfiles = genericLength l
sizefiles = sum (map thd3 l)
-- estimates based on benchmarking
estimate_enabled = sizefiles `div` 191739611
estimate_disabled = numfiles `div` 7
in estimate_enabled <= estimate_disabled
{- This disables filter.annex.process if it's set when it would
- probably not be faster to use it. Unfortunately, simply
- passing -c filter.annex.process= also prevents git from
- running the smudge filter, so .git/config has to be modified
- to disable it. The modification is reversed at the end. In
- case this process is terminated early, the next time this
- runs it will take care of reversing the modification.
-}
configfilterprocess l = bracket setup cleanup . const
where
setup
| filterprocessfaster l = return Nothing
| otherwise = fromRepo (Git.Config.getMaybe ck) >>= \case
Nothing -> return Nothing
Just v -> do
void $ inRepo (Git.Config.change ckd (fromConfigValue v))
void $ inRepo (Git.Config.unset ck)
return (Just v)
cleanup (Just v) = do
void $ inRepo $ Git.Config.change ck (fromConfigValue v)
void $ inRepo (Git.Config.unset ckd)
cleanup Nothing = fromRepo (Git.Config.getMaybe ckd) >>= \case
Nothing -> return ()
Just v -> do
whenM (isNothing <$> fromRepo (Git.Config.getMaybe ck)) $
void $ inRepo (Git.Config.change ck (fromConfigValue v))
void $ inRepo (Git.Config.unset ckd)
ck = ConfigKey "filter.annex.process"
ckd = ConfigKey "filter.annex.process-temp-disabled"
unableToRestage :: Maybe FilePath -> String
unableToRestage mf = unwords

View file

@ -31,7 +31,7 @@ addCommand commonparams command params files = do
store =<< flushWhenFull =<<
(Git.Queue.addCommand commonparams command params files q =<< gitRepo)
addInternalAction :: Git.Queue.InternalActionRunner Annex -> [(RawFilePath, IO Bool)] -> Annex ()
addInternalAction :: Git.Queue.InternalActionRunner Annex -> [(RawFilePath, IO Bool, FileSize)] -> Annex ()
addInternalAction runner files = do
q <- get
store =<< flushWhenFull =<<

View file

@ -241,6 +241,14 @@ fromFile r f = fromPipe r "git"
, Param "--list"
] ConfigList
{- Changes a git config setting in .git/config. -}
change :: ConfigKey -> S.ByteString -> Repo -> IO Bool
change (ConfigKey k) v = Git.Command.runBool
[ Param "config"
, Param (decodeBS k)
, Param (decodeBS v)
]
{- Changes a git config setting in the specified config file.
- (Creates the file if it does not already exist.) -}
changeFile :: FilePath -> ConfigKey -> S.ByteString -> IO Bool

View file

@ -49,11 +49,11 @@ data Action m
- to as the queue grows. -}
| InternalAction
{ getRunner :: InternalActionRunner m
, getInternalFiles :: [(RawFilePath, IO Bool)]
, getInternalFiles :: [(RawFilePath, IO Bool, FileSize)]
}
{- The String must be unique for each internal action. -}
data InternalActionRunner m = InternalActionRunner String (Repo -> [(RawFilePath, IO Bool)] -> m ())
data InternalActionRunner m = InternalActionRunner String (Repo -> [(RawFilePath, IO Bool, FileSize)] -> m ())
instance Eq (InternalActionRunner m) where
InternalActionRunner s1 _ == InternalActionRunner s2 _ = s1 == s2
@ -116,7 +116,7 @@ addCommand commonparams subcommand params files q repo =
different _ = True
{- Adds an internal action to the queue. -}
addInternalAction :: MonadIO m => InternalActionRunner m -> [(RawFilePath, IO Bool)] -> Queue m -> Repo -> m (Queue m)
addInternalAction :: MonadIO m => InternalActionRunner m -> [(RawFilePath, IO Bool, FileSize)] -> Queue m -> Repo -> m (Queue m)
addInternalAction runner files q repo =
updateQueue action different (length files) q repo
where

View file

@ -85,12 +85,17 @@ And here's the consequences of git-annex's workarounds:
* After a git-annex get/drop or a git checkout or pull that affects a lot
of files, the clean filter gets run once per file, which is again, slower
than ideal. Enabling `git-annex filter-process` can speed up git checkout
or pull, but not git-annex get/drop.
than ideal. Enabling `git-annex filter-process` can speed this up
in some cases.
* When `git-annex filter-process` is enabled, it cannot use the trick
described above that `git-annex smudge --clean` uses to avoid git
piping the whole content of large files throuh it.
piping the whole content of large files through it. This mainly slows
down `git add` when it is being used with an annex.largefiles
confguration to add a large file to the annex. (Making filter-process
incrementally hash the content git passes to it will mostly avoid
this performance problem though it may always be a little bit slower
than `git-annex smudge --clean` due to the data piping.)
* In a rare situation, git-annex would like to get git to run the clean
filter, but it cannot because git has the index locked. So, git-annex has
@ -115,97 +120,3 @@ The best fix would be to improve git's smudge/clean interface:
* Allow clean filter to read work tree files itself, to avoid overhead of
sending huge files through a pipe.
----
## benchmarking
Goal is to both characterise how slow this interface makes git-annex,
and to investigate when enabling filter-process is an improvement, and not.
* git add of 1000 small files (adding to git repository not annex)
- no git-annex: 0.2s
- git-annex with smudge --clean: 63.3s
- git-annex with filter-process enabled: 2.3s
This is the obvious win case for filter-process. However, people
rarely add large numbers of small files to a git repository at the
same time.
* git add of 1000 small files (adding to annex)
- git-annex with smudge --clean: 120.9s
- git-annex with filter-process enabled: 28.2s
- (git-annex add of 1000 small files, for comparison): 17.2s
This is a decent win for filter-process, and would also be somewhat
of a win when adding larger files to the annex with git add, though
less so because hashing overhead would dominate that.
* git add of 1 gb file (adding to annex)
- git-annex with smudge --clean: 14.5s
- git-annex with filter-process enabled: 15.4s
This was a surprising result! With filter-process, git feeds
the file to git-annex via a pipe, and git-annex also reads it from
disk. Probably disk caching helped a lot to avoid this taking
longer. (`free` says the disk cache has 1.7gb available)
That double read could be avoided with some work to make
git-annex hash what it receives from the pipe. I also expected
the piping to add more overhead than it seems to have.
* git checkout of branch with 1000 small annexed files
- no git-annex (checking out annex pointer files): 0.1s
- git-annex with smudge: 145s
- git-annex with filter-process enabled: 13.1s
Win for filter-process, but small annexed files are somewhat
unusual. See next benchmark.
* git checkout of branch with 1 gb annexed file
- git-annex with smudge: 5.6s
- git-annex with filter-process enabled: 11.2s
Here filter-process slows it down, and the reason it does
is the post-checkout hook runs, which populates the annexed file
and restages it in git. The restaging uses filter-process, and git
feeds the annexed file contents through the pipe, though git-annex
does not need to see that data. So it makes sense that
filter-process is about twice as slow as smudge, since with smudge
it only has to write the file and does not also read it.
With more annexed data being checked out, it should continue to
scale like this, with filter-process being 2x as expensive,
or perhaps more (if disk cache stops helping).
Disabling filter-process during the restaging would improve
this case, but unfortunately it does not seem easy to do
that (see [[!commit 837025b14f523f9180f82d0cced1e53a8a9b94de]]).
* git-annex get of 1000 small annexed files
- git-annex with smudge: 100.1s
- git-annex with filter-process enabled: 39.3s
The difference is due to restaging in git needing to pass
the annexed files through the filter.
Win for filter-process, but small annexed files are somewhat
unusual. See next benchmark.
* git-annex get of a 1 gb annexed file
- git-annex with smudge: 21.5s
- git-annex with filter-process enabled: 22.8s
Transfer time was around 12s, the rest is copying the file
to the work tree and restaging overhead. So filter-process
is slower because git sends the file content to it over a pipe
unncessarily. Less of a loss for filter-process that I expected
though, but again disk cache probably helped.
* git-annex get of two 1 gb annexed files
- git-annex with smudge: 42.3s
- git-annex with filter-process enabled: 46.7s
This shows that filter-process will get progressively worse
as the amount of annexed data that git-annex gets goes up.
It is not a fast increase, but it will add up. Also disk cache
will stop helping at some point.
Benchmark summary:
* filter-process makes `git add` slightly slower for large
files that are added to the annex, but not as much as expected (and it can
be improved), so overall it's a win for `git add`.
* filter-process makes `git checkout`, `merge`, etc of unlocked annexed files
at least twice as slow as the size of annexed data goes up, but it does avoid
very slow checkouts when there are a lot of non-annexed or smaller unlocked
annexed files. That benefit may be worth the overhead, though it would
be good to check the overhead with larger annexed data checkouts to see
how it scales.
* filter-process makes `git-annex get` slower as the size of annexed data
goes up. Although the time spent actually getting the data will typically
dominate (network being slower than disk), so this may be an acceptable
tradeoff for many users.

View file

@ -16,4 +16,4 @@ could change and if it does, these things could be included.
seem worth it.
It does not currently incrementally hash, so implementing that first
would imporve the tradeoffs.
would improve the tradeoffs.