comment
This commit is contained in:
parent
021ebc651d
commit
916f05e890
1 changed files with 58 additions and 0 deletions
|
@ -0,0 +1,58 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 7"""
|
||||
date="2016-09-14T15:28:23Z"
|
||||
content="""
|
||||
First, note that git-annex 6.20160619 sped up the git-annex
|
||||
command startup time significantly. Please be sure to use a current
|
||||
version in benchmarks, and state the version.
|
||||
|
||||
`git archive` (and `git cat-file --batch --batch-all-objects`) are just
|
||||
reading packs and loose objects in disk order and dumping out the contents.
|
||||
`git cat-file --batch` has to look up objects in the pack index files, seek
|
||||
in the pack, etc. It's not a fair comparison.
|
||||
|
||||
Note that `git annex find`, when used without options like --in or --copies,
|
||||
does not need to read anything from `git cat-file` at all. The
|
||||
`GIT_TRACE_PERFORMANCE` you show is misleading; it's just showing how long
|
||||
the git command is left running, idle.
|
||||
|
||||
`git annex find`'s overhead should be purely traversing the filesystem tree
|
||||
and checking what symlinks point to files. You can write programs that do
|
||||
the same thing without using git at all (or only `git ls-files`), and
|
||||
compare them to git-annex's time; that would be a fairer comparison.
|
||||
Ideally, `git annex find` would be entirely system call bound and would use
|
||||
very little CPU itself.
|
||||
|
||||
By contrast, `git annex copy` makes significant use of `git cat-file --batch`,
|
||||
since it needs to look up location log information to see if the
|
||||
--to/--from remote has the files.
|
||||
|
||||
`git annex copy -J` already parallelizes the parts of the code that look at
|
||||
the location log. Including spinning up a separate `git cat-file --batch`
|
||||
processes for each thread, so they won't contend on such queries. So I
|
||||
would expect that to make it faster, even leaving aside the speed benefits
|
||||
of doing the actual copies in parallel.
|
||||
|
||||
My feeling is that the best way to speed these up is going to be in one
|
||||
of these classes:
|
||||
|
||||
* It's possible that `git cat-file --batch` is somehow slower than it needs
|
||||
to be. Perhaps it's not doing good caching between queries or has
|
||||
inneficient seralization/bad stdio buffering. It might just be the case
|
||||
that using something like libgit2 instead would be faster.
|
||||
(Due to libgit2's poor interface stability, it would have to be an
|
||||
optional build flag.)
|
||||
|
||||
* Many small optimisations to the code. The use of Strings throughout
|
||||
git-annex could well be a source of systematic small innefficiences,
|
||||
and using ByteString might eliminate those. (But this would be a huge job.)
|
||||
(The `git cat-file --batch` communication is already done using
|
||||
bytestrings.)
|
||||
|
||||
* A completely lateral move. For example, if git-annex kept its own
|
||||
database recording which files are present, then `git annex find`
|
||||
could do a simple database query and not need to chase all the symlinks.
|
||||
But such a database needs to somehow be kept in sync or reconciled
|
||||
with the git index, it's not an easy thing.
|
||||
"""]]
|
Loading…
Add table
Reference in a new issue