comment

2016-09-14 12:18:48 -04:00 · 2016-09-14 12:18:48 -04:00 · 916f05e890
commit 916f05e890
parent 021ebc651d
1 changed files with 58 additions and 0 deletions
--- a/doc/todo/make_copy_--fast__faster/comment_7_3f52b6e19035d3c891356c6d98035987._comment
+++ b/doc/todo/make_copy_--fast__faster/comment_7_3f52b6e19035d3c891356c6d98035987._comment
@ -0,0 +1,58 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 7"""
+ date="2016-09-14T15:28:23Z"
+ content="""
+First, note that git-annex 6.20160619 sped up the git-annex 
+command startup time significantly. Please be sure to use a current
+version in benchmarks, and state the version.
+
+`git archive` (and `git cat-file --batch --batch-all-objects`) are just
+reading packs and loose objects in disk order and dumping out the contents.
+`git cat-file --batch` has to look up objects in the pack index files, seek
+in the pack, etc. It's not a fair comparison.
+
+Note that `git annex find`, when used without options like --in or --copies, 
+does not need to read anything from `git cat-file` at all. The
+`GIT_TRACE_PERFORMANCE` you show is misleading; it's just showing how long
+the git command is left running, idle.
+
+`git annex find`'s overhead should be purely traversing the filesystem tree
+and checking what symlinks point to files. You can write programs that do
+the same thing without using git at all (or only `git ls-files`), and
+compare them to git-annex's time; that would be a fairer comparison.
+Ideally, `git annex find` would be entirely system call bound and would use
+very little CPU itself.
+
+By contrast, `git annex copy` makes significant use of `git cat-file --batch`,
+since it needs to look up location log information to see if the
+--to/--from remote has the files.
+
+`git annex copy -J` already parallelizes the parts of the code that look at
+the location log. Including spinning up a separate `git cat-file --batch`
+processes for each thread, so they won't contend on such queries. So I
+would expect that to make it faster, even leaving aside the speed benefits
+of doing the actual copies in parallel.
+
+My feeling is that the best way to speed these up is going to be in one 
+of these classes:
+
+* It's possible that `git cat-file --batch` is somehow slower than it needs
+  to be. Perhaps it's not doing good caching between queries or has
+  inneficient seralization/bad stdio buffering. It might just be the case 
+  that using something  like libgit2 instead would be faster.
+  (Due to libgit2's poor interface stability, it would have to be an
+  optional build flag.)
+
+* Many small optimisations to the code. The use of Strings throughout
+  git-annex could well be a source of systematic small innefficiences,
+  and using ByteString might eliminate those. (But this would be a huge job.)
+  (The `git cat-file --batch` communication is already done using
+  bytestrings.)
+
+* A completely lateral move. For example, if git-annex kept its own
+  database recording which files are present, then `git annex find`
+  could do a simple database query and not need to chase all the symlinks.
+  But such a database needs to somehow be kept in sync or reconciled
+  with the git index, it's not an easy thing.
+"""]]