This commit is contained in:
Joey Hess 2016-09-14 12:18:48 -04:00
parent 021ebc651d
commit 916f05e890
No known key found for this signature in database
GPG key ID: C910D9222512E3C7

View file

@ -0,0 +1,58 @@
[[!comment format=mdwn
username="joey"
subject="""comment 7"""
date="2016-09-14T15:28:23Z"
content="""
First, note that git-annex 6.20160619 sped up the git-annex
command startup time significantly. Please be sure to use a current
version in benchmarks, and state the version.
`git archive` (and `git cat-file --batch --batch-all-objects`) are just
reading packs and loose objects in disk order and dumping out the contents.
`git cat-file --batch` has to look up objects in the pack index files, seek
in the pack, etc. It's not a fair comparison.
Note that `git annex find`, when used without options like --in or --copies,
does not need to read anything from `git cat-file` at all. The
`GIT_TRACE_PERFORMANCE` you show is misleading; it's just showing how long
the git command is left running, idle.
`git annex find`'s overhead should be purely traversing the filesystem tree
and checking what symlinks point to files. You can write programs that do
the same thing without using git at all (or only `git ls-files`), and
compare them to git-annex's time; that would be a fairer comparison.
Ideally, `git annex find` would be entirely system call bound and would use
very little CPU itself.
By contrast, `git annex copy` makes significant use of `git cat-file --batch`,
since it needs to look up location log information to see if the
--to/--from remote has the files.
`git annex copy -J` already parallelizes the parts of the code that look at
the location log. Including spinning up a separate `git cat-file --batch`
processes for each thread, so they won't contend on such queries. So I
would expect that to make it faster, even leaving aside the speed benefits
of doing the actual copies in parallel.
My feeling is that the best way to speed these up is going to be in one
of these classes:
* It's possible that `git cat-file --batch` is somehow slower than it needs
to be. Perhaps it's not doing good caching between queries or has
inneficient seralization/bad stdio buffering. It might just be the case
that using something like libgit2 instead would be faster.
(Due to libgit2's poor interface stability, it would have to be an
optional build flag.)
* Many small optimisations to the code. The use of Strings throughout
git-annex could well be a source of systematic small innefficiences,
and using ByteString might eliminate those. (But this would be a huge job.)
(The `git cat-file --batch` communication is already done using
bytestrings.)
* A completely lateral move. For example, if git-annex kept its own
database recording which files are present, then `git annex find`
could do a simple database query and not need to chase all the symlinks.
But such a database needs to somehow be kept in sync or reconciled
with the git index, it's not an easy thing.
"""]]