benchmarking of filter-process vs smudge/clean

No firm conclusions yet, but it's doing better than I would have expected. Sponsored-by: Graham Spencer on Patreon
2021-11-05 13:37:53 -04:00 · 2021-11-05 13:37:53 -04:00 · 054c803f8d
commit 054c803f8d
parent 099e8fe061
1 changed files with 63 additions and 7 deletions
--- a/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
+++ b/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn
@ -120,6 +120,9 @@ The best fix would be to improve git's smudge/clean interface:

 ## benchmarking

+Goal is to both characterise how slow this interface makes git-annex,
+and to investigate when enabling filter-process is an improvement, and not.
+
 * git add of 1000 small files (adding to git repository not annex)
 	- no git-annex: 0.2s
 	- git-annex with smudge --clean: 63.3s
@ -146,10 +149,63 @@ The best fix would be to improve git's smudge/clean interface:
 	the piping to add more overhead than it seems to have.
 * git checkout of branch with 1000 small annexed files
 	- no git-annex (checking out annex pointer files): 0.1s
-	- git-annex with smudge: 83.4s
-	- git-annex with filter-process: 16.0s ()
-	With filter-process, the actual checkout takes under a second,
-	then the post-checkout hook which populates the annexed files
-	and restages them in git. The restaging does not
-	use filter-process currently. The number in parens is with
-	git-annex modified so the restaging does use filter-process.
+	- git-annex with smudge: 145s
+	- git-annex with filter-process enabled: 13.1s
+	Win for filter-process, but small annexed files are somewhat
+	unusual. See next benchmark.
+* git checkout of branch with 1 gb annexed file
+	- git-annex with smudge: 5.6s
+	- git-annex with filter-process enabled: 11.2s
+	Here filter-process slows it down, and the reason it does
+	is the post-checkout hook runs, which populates the annexed file
+	and restages it in git. The restaging uses filter-process, and git
+	feeds the annexed file contents through the pipe, though git-annex
+	does not need to see that data. So it makes sense that
+	filter-process is about twice as slow as smudge, since with smudge
+	it only has to write the file and does not also read it.
+	With more annexed data being checked out, it should continue to
+	scale like this, with filter-process being 2x as expensive,
+	or perhaps more (if disk cache stops helping).
+	Disabling filter-process during the restaging would improve
+	this case, but unfortunately it does not seem easy to do
+	that (see [[!commit 837025b14f523f9180f82d0cced1e53a8a9b94de]]).
+* git-annex get of 1000 small annexed files
+	- git-annex with smudge: 100.1s
+	- git-annex with filter-process enabled: 39.3s
+	The difference is due to restaging in git needing to pass
+	the annexed files through the filter.
+	Win for filter-process, but small annexed files are somewhat
+	unusual. See next benchmark.
+* git-annex get of a 1 gb annexed file
+	- git-annex with smudge: 21.5s
+	- git-annex with filter-process enabled: 22.8s
+	Transfer time was around 12s, the rest is copying the file
+	to the work tree and restaging overhead. So filter-process
+	is slower because git sends the file content to it over a pipe
+	unncessarily. Less of a loss for filter-process that I expected
+	though, but again disk cache probably helped.
+* git-annex get of two 1 gb annexed files
+	- git-annex with smudge: 42.3s
+	- git-annex with filter-process enabled: 46.7s
+	This shows that filter-process will get progressively worse
+	as the amount of annexed data that git-annex gets goes up.
+	It is not a fast increase, but it will add up. Also disk cache
+	will stop helping at some point.
+
+Benchmark summary:
+
+* filter-process makes `git add` slightly slower for large
+  files that are added to the annex, but not as much as expected (and it can
+  be improved), so overall it's a win for `git add`.
+
+* filter-process makes `git checkout`, `merge`, etc of unlocked annexed files
+  at least twice as slow as the size of annexed data goes up, but it does avoid
+  very slow checkouts when there are a lot of non-annexed or smaller unlocked
+  annexed files. That benefit may be worth the overhead, though it would
+  be good to check the overhead with larger annexed data checkouts to see
+  how it scales.
+
+* filter-process makes `git-annex get` slower as the size of annexed data
+  goes up. Although the time spent actually getting the data will typically
+  dominate (network being slower than disk), so this may be an acceptable
+  tradeoff for many users.