2013-11-06 21:38:24 +00:00 · 2013-11-06 21:38:24 +00:00 · 280a3204a2
commit 280a3204a2
parent 2c37b6ca1d
1 changed files with 45 additions and 0 deletions
--- a/doc/forum/_Does_git_annex_find_4038_friends41_batch_queries_to_the_location_log63.mdwn
+++ b/doc/forum/_Does_git_annex_find_4038_friends41_batch_queries_to_the_location_log63.mdwn
@ -0,0 +1,45 @@
+Hi,
+
+Some time ago I asked [[here|git_annex_copy_--fast_--to_blah_much_slower_than_--from_blah]]
+about possible improvements in git `copy --fast --to`, since it was painfully slow
+on moderately large repos.
+
+Now I found a way to make it much faster for my particular use case, by
+accessing some annex internals. And I realized that maybe commands like `git
+annex find --in=repo` do not batch queries to the location log. This is based on
+the following timings, on a new repo (only a few commits) and about 30k files.
+
+    > time git annex find --in=skynet > /dev/null
+
+    real	0m55.838s
+    user	0m30.000s
+    sys	0m1.583s
+
+
+    > time git ls-tree -r git-annex | cut -d ' ' -f 3 | cut -f 1 | git cat-file --batch > /dev/null
+
+    real	0m0.334s
+    user	0m0.517s
+    sys	0m0.030s
+
+Those numbers are on linux (with an already warm file cache) and an ext4 filesystem on a SSD.
+
+The second command above is feeding a list of objects to a single `git cat-file`
+process that cats them all to stdout, preceeding every file dump by the object
+being cat-ed. It is a trivial matter to parse this output and use it for
+whatever annex needs.
+
+Above I wrote a `git ls-tree` on the git-annex branch for simplicity, but we could
+just as well do a `ls-tree | ... | git cat-file` on HEAD to get the keys for the
+annexed files matching some path and then feed those keys to a cat-file on
+the git-annex branch. And this still would be an order of magnitude faster than
+what currently annex seems to do.
+
+I'm assuming the bottleneck is in that annex does not batch the `cat-file`, as the rest of logic needed for a find will be fast. Is that right? 
+
+Now, if the queries to the location log for `copy --to` and `find` could be batched this way, the
+performance of several useful things to do, like checking how many annexed files
+we are missing, would be bastly improved. Hell, I could even put that number on
+the command line prompt!
+
+I'm not yet very fluent in Haskell, but I'm willing to help if this is something that makes sense and can be done.