benchmarking of filter-process vs smudge/clean

No firm conclusions yet, but it's doing better than I would have
expected.

Sponsored-by: Graham Spencer on Patreon
This commit is contained in:
Joey Hess 2021-11-05 13:37:53 -04:00
parent 099e8fe061
commit 054c803f8d
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38

View file

@ -120,6 +120,9 @@ The best fix would be to improve git's smudge/clean interface:
## benchmarking
Goal is to both characterise how slow this interface makes git-annex,
and to investigate when enabling filter-process is an improvement, and not.
* git add of 1000 small files (adding to git repository not annex)
- no git-annex: 0.2s
- git-annex with smudge --clean: 63.3s
@ -146,10 +149,63 @@ The best fix would be to improve git's smudge/clean interface:
the piping to add more overhead than it seems to have.
* git checkout of branch with 1000 small annexed files
- no git-annex (checking out annex pointer files): 0.1s
- git-annex with smudge: 83.4s
- git-annex with filter-process: 16.0s ()
With filter-process, the actual checkout takes under a second,
then the post-checkout hook which populates the annexed files
and restages them in git. The restaging does not
use filter-process currently. The number in parens is with
git-annex modified so the restaging does use filter-process.
- git-annex with smudge: 145s
- git-annex with filter-process enabled: 13.1s
Win for filter-process, but small annexed files are somewhat
unusual. See next benchmark.
* git checkout of branch with 1 gb annexed file
- git-annex with smudge: 5.6s
- git-annex with filter-process enabled: 11.2s
Here filter-process slows it down, and the reason it does
is the post-checkout hook runs, which populates the annexed file
and restages it in git. The restaging uses filter-process, and git
feeds the annexed file contents through the pipe, though git-annex
does not need to see that data. So it makes sense that
filter-process is about twice as slow as smudge, since with smudge
it only has to write the file and does not also read it.
With more annexed data being checked out, it should continue to
scale like this, with filter-process being 2x as expensive,
or perhaps more (if disk cache stops helping).
Disabling filter-process during the restaging would improve
this case, but unfortunately it does not seem easy to do
that (see [[!commit 837025b14f523f9180f82d0cced1e53a8a9b94de]]).
* git-annex get of 1000 small annexed files
- git-annex with smudge: 100.1s
- git-annex with filter-process enabled: 39.3s
The difference is due to restaging in git needing to pass
the annexed files through the filter.
Win for filter-process, but small annexed files are somewhat
unusual. See next benchmark.
* git-annex get of a 1 gb annexed file
- git-annex with smudge: 21.5s
- git-annex with filter-process enabled: 22.8s
Transfer time was around 12s, the rest is copying the file
to the work tree and restaging overhead. So filter-process
is slower because git sends the file content to it over a pipe
unncessarily. Less of a loss for filter-process that I expected
though, but again disk cache probably helped.
* git-annex get of two 1 gb annexed files
- git-annex with smudge: 42.3s
- git-annex with filter-process enabled: 46.7s
This shows that filter-process will get progressively worse
as the amount of annexed data that git-annex gets goes up.
It is not a fast increase, but it will add up. Also disk cache
will stop helping at some point.
Benchmark summary:
* filter-process makes `git add` slightly slower for large
files that are added to the annex, but not as much as expected (and it can
be improved), so overall it's a win for `git add`.
* filter-process makes `git checkout`, `merge`, etc of unlocked annexed files
at least twice as slow as the size of annexed data goes up, but it does avoid
very slow checkouts when there are a lot of non-annexed or smaller unlocked
annexed files. That benefit may be worth the overhead, though it would
be good to check the overhead with larger annexed data checkouts to see
how it scales.
* filter-process makes `git-annex get` slower as the size of annexed data
goes up. Although the time spent actually getting the data will typically
dominate (network being slower than disk), so this may be an acceptable
tradeoff for many users.