benchmarking of filter-process vs smudge/clean
No firm conclusions yet, but it's doing better than I would have expected. Sponsored-by: Graham Spencer on Patreon
This commit is contained in:
parent
099e8fe061
commit
054c803f8d
1 changed files with 63 additions and 7 deletions
|
@ -120,6 +120,9 @@ The best fix would be to improve git's smudge/clean interface:
|
||||||
|
|
||||||
## benchmarking
|
## benchmarking
|
||||||
|
|
||||||
|
Goal is to both characterise how slow this interface makes git-annex,
|
||||||
|
and to investigate when enabling filter-process is an improvement, and not.
|
||||||
|
|
||||||
* git add of 1000 small files (adding to git repository not annex)
|
* git add of 1000 small files (adding to git repository not annex)
|
||||||
- no git-annex: 0.2s
|
- no git-annex: 0.2s
|
||||||
- git-annex with smudge --clean: 63.3s
|
- git-annex with smudge --clean: 63.3s
|
||||||
|
@ -146,10 +149,63 @@ The best fix would be to improve git's smudge/clean interface:
|
||||||
the piping to add more overhead than it seems to have.
|
the piping to add more overhead than it seems to have.
|
||||||
* git checkout of branch with 1000 small annexed files
|
* git checkout of branch with 1000 small annexed files
|
||||||
- no git-annex (checking out annex pointer files): 0.1s
|
- no git-annex (checking out annex pointer files): 0.1s
|
||||||
- git-annex with smudge: 83.4s
|
- git-annex with smudge: 145s
|
||||||
- git-annex with filter-process: 16.0s ()
|
- git-annex with filter-process enabled: 13.1s
|
||||||
With filter-process, the actual checkout takes under a second,
|
Win for filter-process, but small annexed files are somewhat
|
||||||
then the post-checkout hook which populates the annexed files
|
unusual. See next benchmark.
|
||||||
and restages them in git. The restaging does not
|
* git checkout of branch with 1 gb annexed file
|
||||||
use filter-process currently. The number in parens is with
|
- git-annex with smudge: 5.6s
|
||||||
git-annex modified so the restaging does use filter-process.
|
- git-annex with filter-process enabled: 11.2s
|
||||||
|
Here filter-process slows it down, and the reason it does
|
||||||
|
is the post-checkout hook runs, which populates the annexed file
|
||||||
|
and restages it in git. The restaging uses filter-process, and git
|
||||||
|
feeds the annexed file contents through the pipe, though git-annex
|
||||||
|
does not need to see that data. So it makes sense that
|
||||||
|
filter-process is about twice as slow as smudge, since with smudge
|
||||||
|
it only has to write the file and does not also read it.
|
||||||
|
With more annexed data being checked out, it should continue to
|
||||||
|
scale like this, with filter-process being 2x as expensive,
|
||||||
|
or perhaps more (if disk cache stops helping).
|
||||||
|
Disabling filter-process during the restaging would improve
|
||||||
|
this case, but unfortunately it does not seem easy to do
|
||||||
|
that (see [[!commit 837025b14f523f9180f82d0cced1e53a8a9b94de]]).
|
||||||
|
* git-annex get of 1000 small annexed files
|
||||||
|
- git-annex with smudge: 100.1s
|
||||||
|
- git-annex with filter-process enabled: 39.3s
|
||||||
|
The difference is due to restaging in git needing to pass
|
||||||
|
the annexed files through the filter.
|
||||||
|
Win for filter-process, but small annexed files are somewhat
|
||||||
|
unusual. See next benchmark.
|
||||||
|
* git-annex get of a 1 gb annexed file
|
||||||
|
- git-annex with smudge: 21.5s
|
||||||
|
- git-annex with filter-process enabled: 22.8s
|
||||||
|
Transfer time was around 12s, the rest is copying the file
|
||||||
|
to the work tree and restaging overhead. So filter-process
|
||||||
|
is slower because git sends the file content to it over a pipe
|
||||||
|
unncessarily. Less of a loss for filter-process that I expected
|
||||||
|
though, but again disk cache probably helped.
|
||||||
|
* git-annex get of two 1 gb annexed files
|
||||||
|
- git-annex with smudge: 42.3s
|
||||||
|
- git-annex with filter-process enabled: 46.7s
|
||||||
|
This shows that filter-process will get progressively worse
|
||||||
|
as the amount of annexed data that git-annex gets goes up.
|
||||||
|
It is not a fast increase, but it will add up. Also disk cache
|
||||||
|
will stop helping at some point.
|
||||||
|
|
||||||
|
Benchmark summary:
|
||||||
|
|
||||||
|
* filter-process makes `git add` slightly slower for large
|
||||||
|
files that are added to the annex, but not as much as expected (and it can
|
||||||
|
be improved), so overall it's a win for `git add`.
|
||||||
|
|
||||||
|
* filter-process makes `git checkout`, `merge`, etc of unlocked annexed files
|
||||||
|
at least twice as slow as the size of annexed data goes up, but it does avoid
|
||||||
|
very slow checkouts when there are a lot of non-annexed or smaller unlocked
|
||||||
|
annexed files. That benefit may be worth the overhead, though it would
|
||||||
|
be good to check the overhead with larger annexed data checkouts to see
|
||||||
|
how it scales.
|
||||||
|
|
||||||
|
* filter-process makes `git-annex get` slower as the size of annexed data
|
||||||
|
goes up. Although the time spent actually getting the data will typically
|
||||||
|
dominate (network being slower than disk), so this may be an acceptable
|
||||||
|
tradeoff for many users.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue