diff --git a/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn b/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn index c807659dcc..b7b92d95b8 100644 --- a/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn +++ b/doc/todo/git_smudge_clean_interface_suboptiomal.mdwn @@ -120,6 +120,9 @@ The best fix would be to improve git's smudge/clean interface: ## benchmarking +Goal is to both characterise how slow this interface makes git-annex, +and to investigate when enabling filter-process is an improvement, and not. + * git add of 1000 small files (adding to git repository not annex) - no git-annex: 0.2s - git-annex with smudge --clean: 63.3s @@ -146,10 +149,63 @@ The best fix would be to improve git's smudge/clean interface: the piping to add more overhead than it seems to have. * git checkout of branch with 1000 small annexed files - no git-annex (checking out annex pointer files): 0.1s - - git-annex with smudge: 83.4s - - git-annex with filter-process: 16.0s () - With filter-process, the actual checkout takes under a second, - then the post-checkout hook which populates the annexed files - and restages them in git. The restaging does not - use filter-process currently. The number in parens is with - git-annex modified so the restaging does use filter-process. + - git-annex with smudge: 145s + - git-annex with filter-process enabled: 13.1s + Win for filter-process, but small annexed files are somewhat + unusual. See next benchmark. +* git checkout of branch with 1 gb annexed file + - git-annex with smudge: 5.6s + - git-annex with filter-process enabled: 11.2s + Here filter-process slows it down, and the reason it does + is the post-checkout hook runs, which populates the annexed file + and restages it in git. The restaging uses filter-process, and git + feeds the annexed file contents through the pipe, though git-annex + does not need to see that data. So it makes sense that + filter-process is about twice as slow as smudge, since with smudge + it only has to write the file and does not also read it. + With more annexed data being checked out, it should continue to + scale like this, with filter-process being 2x as expensive, + or perhaps more (if disk cache stops helping). + Disabling filter-process during the restaging would improve + this case, but unfortunately it does not seem easy to do + that (see [[!commit 837025b14f523f9180f82d0cced1e53a8a9b94de]]). +* git-annex get of 1000 small annexed files + - git-annex with smudge: 100.1s + - git-annex with filter-process enabled: 39.3s + The difference is due to restaging in git needing to pass + the annexed files through the filter. + Win for filter-process, but small annexed files are somewhat + unusual. See next benchmark. +* git-annex get of a 1 gb annexed file + - git-annex with smudge: 21.5s + - git-annex with filter-process enabled: 22.8s + Transfer time was around 12s, the rest is copying the file + to the work tree and restaging overhead. So filter-process + is slower because git sends the file content to it over a pipe + unncessarily. Less of a loss for filter-process that I expected + though, but again disk cache probably helped. +* git-annex get of two 1 gb annexed files + - git-annex with smudge: 42.3s + - git-annex with filter-process enabled: 46.7s + This shows that filter-process will get progressively worse + as the amount of annexed data that git-annex gets goes up. + It is not a fast increase, but it will add up. Also disk cache + will stop helping at some point. + +Benchmark summary: + +* filter-process makes `git add` slightly slower for large + files that are added to the annex, but not as much as expected (and it can + be improved), so overall it's a win for `git add`. + +* filter-process makes `git checkout`, `merge`, etc of unlocked annexed files + at least twice as slow as the size of annexed data goes up, but it does avoid + very slow checkouts when there are a lot of non-annexed or smaller unlocked + annexed files. That benefit may be worth the overhead, though it would + be good to check the overhead with larger annexed data checkouts to see + how it scales. + +* filter-process makes `git-annex get` slower as the size of annexed data + goes up. Although the time spent actually getting the data will typically + dominate (network being slower than disk), so this may be an acceptable + tradeoff for many users.