dynamically disable filter-process in restagePointerFile when it would be slower

Based on my earlier benchmark, I have a rough cost model for how expensive it is for git-annex smudge to be run on a file, vs how expensive it is for a gigabyte of a file's content to be read and piped through to filter-process. So, using that cost model, it can decide if using filter-process will be more or less expensive than running the smudge filter on the files to be restaged. It turned out to be *really* annoying to temporarily disable filter-process. I did find a way, but urk, this is horrible. Notice that, if it's interrupted with it disabled, it will remain disabled until the next time restagePointerFile runs. Which could be some time later. If the user runs `git add` or `git checkout` on a lot of small files before that, they will see slower than expected performance. (This commit also deletes where I wrote down the benchmark results earlier.) Sponsored-by: Noam Kremen on Patreon
2021-11-08 15:55:27 -04:00 · 2021-11-08 15:55:27 -04:00 · a0758bdd10
commit a0758bdd10
parent 054c803f8d
6 changed files with 71 additions and 108 deletions
--- a/Annex/Link.hs
+++ b/Annex/Link.hs
@ -12,7 +12,7 @@
 - Licensed under the GNU AGPL version 3 or higher.
 -}

-{-# LANGUAGE CPP, BangPatterns #-}
+{-# LANGUAGE CPP, BangPatterns, OverloadedStrings #-}

 module Annex.Link where

@ -35,6 +35,7 @@ import Utility.FileMode
 import Utility.InodeCache
 import Utility.Tmp.Dir
 import Utility.CopyFile
+import Utility.Tuple
 import qualified Database.Keys.Handle
 import qualified Utility.RawFilePath as R

@ -89,9 +90,9 @@ getAnnexLinkTarget' file coresymlinks = if coresymlinks
 			else 
 				-- If there are any NUL or newline
 				-- characters, or whitespace, we
-				-- certianly don't have a symlink to a
+				-- certainly don't have a symlink to a
 				-- git-annex key.
-				if any (`S8.elem` s) "\0\n\r \t"
+				if any (`S8.elem` s) ("\0\n\r \t" :: [Char])
 					then mempty
 					else s

@ -189,7 +190,7 @@ restagePointerFile (Restage True) f orig = withTSDelta $ \tsd ->
 		-- fails on "../../repo/path/file" when cwd is not in the repo 
 		-- being acted on. Avoid these problems with an absolute path.
 		absf <- liftIO $ absPath f
-		Annex.Queue.addInternalAction runner [(absf, isunmodified tsd)]
+		Annex.Queue.addInternalAction runner [(absf, isunmodified tsd, inodeCacheFileSize orig)]
  where
 	isunmodified tsd = genInodeCache f tsd >>= return . \case
 		Nothing -> False
@ -226,9 +227,9 @@ restagePointerFile (Restage True) f orig = withTSDelta $ \tsd ->
 					[ Param "-c"
 					, Param $ "core.safecrlf=" ++ boolConfig False
 					] }
-				runsGitAnnexChildProcessViaGit' r'' $ \r''' ->
+				configfilterprocess l $ runsGitAnnexChildProcessViaGit' r'' $ \r''' ->
 					liftIO $ Git.UpdateIndex.refreshIndex r''' $ \feed ->
-						forM_ l $ \(f', checkunmodified) ->
+						forM_ l $ \(f', checkunmodified, _) ->
 							whenM checkunmodified $
 								feed f'
 			let replaceindex = catchBoolIO $ do
@ -239,6 +240,49 @@ restagePointerFile (Restage True) f orig = withTSDelta $ \tsd ->
 				<&&> liftIO replaceindex
 			unless ok showwarning
 		bracket lockindex unlockindex go
+	
+	{- filter.annex.process configured to use git-annex filter-process
+	 - is sometimes faster and sometimes slower than using
+	 - git-annex smudge. The latter is run once per file, while
+	 - the former has the content of files piped to it.
+	 -}
+	filterprocessfaster l = 
+		let numfiles = genericLength l
+		    sizefiles = sum (map thd3 l)
+		    -- estimates based on benchmarking
+		    estimate_enabled = sizefiles `div` 191739611
+		    estimate_disabled = numfiles `div` 7
+		in estimate_enabled <= estimate_disabled
+	 
+	 {- This disables filter.annex.process if it's set when it would
+	  - probably not be faster to use it. Unfortunately, simply
+	  - passing -c filter.annex.process= also prevents git from
+	  - running the smudge filter, so .git/config has to be modified
+	  - to disable it. The modification is reversed at the end. In
+	  - case this process is terminated early, the next time this
+	  - runs it will take care of reversing the modification.
+	  -}
+	configfilterprocess l = bracket setup cleanup . const
+	  where
+		setup
+			| filterprocessfaster l = return Nothing
+			| otherwise = fromRepo (Git.Config.getMaybe ck) >>= \case
+				Nothing -> return Nothing
+				Just v -> do
+					void $ inRepo (Git.Config.change ckd (fromConfigValue v))
+					void $ inRepo (Git.Config.unset ck)
+					return (Just v)
+		cleanup (Just v) = do
+			void $ inRepo $ Git.Config.change ck (fromConfigValue v)
+			void $ inRepo (Git.Config.unset ckd)
+		cleanup Nothing = fromRepo (Git.Config.getMaybe ckd) >>= \case
+			Nothing -> return ()
+			Just v -> do
+				whenM (isNothing <$> fromRepo (Git.Config.getMaybe ck)) $
+					void $ inRepo (Git.Config.change ck (fromConfigValue v))
+				void $ inRepo (Git.Config.unset ckd)
+		ck = ConfigKey "filter.annex.process"
+		ckd = ConfigKey "filter.annex.process-temp-disabled"

 unableToRestage :: Maybe FilePath -> String
 unableToRestage mf = unwords
--- a/Annex/Queue.hs
+++ b/Annex/Queue.hs
@ -31,7 +31,7 @@ addCommand commonparams command params files = do
 	store =<< flushWhenFull =<<
 		(Git.Queue.addCommand commonparams command params files q =<< gitRepo)

-addInternalAction :: Git.Queue.InternalActionRunner Annex -> [(RawFilePath, IO Bool)] -> Annex ()
+addInternalAction :: Git.Queue.InternalActionRunner Annex -> [(RawFilePath, IO Bool, FileSize)] -> Annex ()
 addInternalAction runner files = do
 	q <- get
 	store =<< flushWhenFull =<<
--- a/Git/Config.hs
+++ b/Git/Config.hs
@ -241,6 +241,14 @@ fromFile r f = fromPipe r "git"
 	, Param "--list"
 	] ConfigList

+{- Changes a git config setting in .git/config. -}
+change :: ConfigKey -> S.ByteString -> Repo -> IO Bool
+change (ConfigKey k) v = Git.Command.runBool
+	[ Param "config"
+	, Param (decodeBS k)
+	, Param (decodeBS v)
+	]
+
 {- Changes a git config setting in the specified config file.
 - (Creates the file if it does not already exist.) -}
 changeFile :: FilePath -> ConfigKey -> S.ByteString -> IO Bool
--- a/Git/Queue.hs
+++ b/Git/Queue.hs
@ -49,11 +49,11 @@ data Action m
 	 - to as the queue grows. -}
 	| InternalAction
 		{ getRunner :: InternalActionRunner m
-		, getInternalFiles :: [(RawFilePath, IO Bool)]
+		, getInternalFiles :: [(RawFilePath, IO Bool, FileSize)]
 		}

 {- The String must be unique for each internal action. -}
-data InternalActionRunner m = InternalActionRunner String (Repo -> [(RawFilePath, IO Bool)] -> m ())
+data InternalActionRunner m = InternalActionRunner String (Repo -> [(RawFilePath, IO Bool, FileSize)] -> m ())

 instance Eq (InternalActionRunner m) where
 	InternalActionRunner s1 _ == InternalActionRunner s2 _ = s1 == s2
@ -116,7 +116,7 @@ addCommand commonparams subcommand params files q repo =
 	different _ = True

 {- Adds an internal action to the queue. -}
-addInternalAction :: MonadIO m => InternalActionRunner m -> [(RawFilePath, IO Bool)] -> Queue m -> Repo -> m (Queue m)
+addInternalAction :: MonadIO m => InternalActionRunner m -> [(RawFilePath, IO Bool, FileSize)] -> Queue m -> Repo -> m (Queue m)
 addInternalAction runner files q repo =
 	updateQueue action different (length files) q repo
  where
--- a/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
+++ b/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
@ -85,12 +85,17 @@ And here's the consequences of git-annex's workarounds:

 * After a git-annex get/drop or a git checkout or pull that affects a lot
  of files, the clean filter gets run once per file, which is again, slower
-  than ideal. Enabling `git-annex filter-process` can speed up git checkout
-  or pull, but not git-annex get/drop.
+  than ideal. Enabling `git-annex filter-process` can speed this up
+  in some cases.

 * When `git-annex filter-process` is enabled, it cannot use the trick
  described above that `git-annex smudge --clean` uses to avoid git
-  piping the whole content of large files throuh it.
+  piping the whole content of large files through it. This mainly slows
+  down `git add` when it is being used with an annex.largefiles
+  confguration to add a large file to the annex. (Making filter-process
+  incrementally hash the content git passes to it will mostly avoid
+  this performance problem though it may always be a little bit slower
+  than `git-annex smudge --clean` due to the data piping.)

 * In a rare situation, git-annex would like to get git to run the clean
  filter, but it cannot because git has the index locked. So, git-annex has
@ -115,97 +120,3 @@ The best fix would be to improve git's smudge/clean interface:

 * Allow clean filter to read work tree files itself, to avoid overhead of
  sending huge files through a pipe.
-
----
-
-## benchmarking
-
-Goal is to both characterise how slow this interface makes git-annex,
-and to investigate when enabling filter-process is an improvement, and not.
-
-* git add of 1000 small files (adding to git repository not annex)
-	- no git-annex: 0.2s
-	- git-annex with smudge --clean: 63.3s
-	- git-annex with filter-process enabled: 2.3s
-	This is the obvious win case for filter-process. However, people
-	rarely add large numbers of small files to a git repository at the
-	same time.
-* git add of 1000 small files (adding to annex)
-	- git-annex with smudge --clean: 120.9s
-	- git-annex with filter-process enabled: 28.2s
-	- (git-annex add of 1000 small files, for comparison): 17.2s
-	This is a decent win for filter-process, and would also be somewhat
-	of a win when adding larger files to the annex with git add, though
-	less so because hashing overhead would dominate that.
-* git add of 1 gb file (adding to annex)
-	- git-annex with smudge --clean: 14.5s
-	- git-annex with filter-process enabled: 15.4s
-	This was a surprising result! With filter-process, git feeds
-	the file to git-annex via a pipe, and git-annex also reads it from
-	disk. Probably disk caching helped a lot to avoid this taking
-	longer. (`free` says the disk cache has 1.7gb available)
-	That double read could be avoided with some work to make
-	git-annex hash what it receives from the pipe. I also expected
-	the piping to add more overhead than it seems to have.
-* git checkout of branch with 1000 small annexed files
-	- no git-annex (checking out annex pointer files): 0.1s
-	- git-annex with smudge: 145s
-	- git-annex with filter-process enabled: 13.1s
-	Win for filter-process, but small annexed files are somewhat
-	unusual. See next benchmark.
-* git checkout of branch with 1 gb annexed file
-	- git-annex with smudge: 5.6s
-	- git-annex with filter-process enabled: 11.2s
-	Here filter-process slows it down, and the reason it does
-	is the post-checkout hook runs, which populates the annexed file
-	and restages it in git. The restaging uses filter-process, and git
-	feeds the annexed file contents through the pipe, though git-annex
-	does not need to see that data. So it makes sense that
-	filter-process is about twice as slow as smudge, since with smudge
-	it only has to write the file and does not also read it.
-	With more annexed data being checked out, it should continue to
-	scale like this, with filter-process being 2x as expensive,
-	or perhaps more (if disk cache stops helping).
-	Disabling filter-process during the restaging would improve
-	this case, but unfortunately it does not seem easy to do
-	that (see [[!commit 837025b14f523f9180f82d0cced1e53a8a9b94de]]).
-* git-annex get of 1000 small annexed files
-	- git-annex with smudge: 100.1s
-	- git-annex with filter-process enabled: 39.3s
-	The difference is due to restaging in git needing to pass
-	the annexed files through the filter.
-	Win for filter-process, but small annexed files are somewhat
-	unusual. See next benchmark.
-* git-annex get of a 1 gb annexed file
-	- git-annex with smudge: 21.5s
-	- git-annex with filter-process enabled: 22.8s
-	Transfer time was around 12s, the rest is copying the file
-	to the work tree and restaging overhead. So filter-process
-	is slower because git sends the file content to it over a pipe
-	unncessarily. Less of a loss for filter-process that I expected
-	though, but again disk cache probably helped.
-* git-annex get of two 1 gb annexed files
-	- git-annex with smudge: 42.3s
-	- git-annex with filter-process enabled: 46.7s
-	This shows that filter-process will get progressively worse
-	as the amount of annexed data that git-annex gets goes up.
-	It is not a fast increase, but it will add up. Also disk cache
-	will stop helping at some point.
-
-Benchmark summary:
-
-* filter-process makes `git add` slightly slower for large
-  files that are added to the annex, but not as much as expected (and it can
-  be improved), so overall it's a win for `git add`.
-
-* filter-process makes `git checkout`, `merge`, etc of unlocked annexed files
-  at least twice as slow as the size of annexed data goes up, but it does avoid
-  very slow checkouts when there are a lot of non-annexed or smaller unlocked
-  annexed files. That benefit may be worth the overhead, though it would
-  be good to check the overhead with larger annexed data checkouts to see
-  how it scales.
-
-* filter-process makes `git-annex get` slower as the size of annexed data
-  goes up. Although the time spent actually getting the data will typically
-  dominate (network being slower than disk), so this may be an acceptable
-  tradeoff for many users.
--- a/doc/todo/v9_changes.mdwn
+++ b/doc/todo/v9_changes.mdwn
@ -16,4 +16,4 @@ could change and if it does, these things could be included.
  seem worth it.

  It does not currently incrementally hash, so implementing that first
-  would imporve the tradeoffs.
+  would improve the tradeoffs.