Testing b9ac585454, it didn't find the
optimal union merge, the second sha was the one to use, at least in
the case I tried. Let's just try all shas to see if any can be reused.
I stopped using the expensive nub, so despite the use of sets to
sort/uniq file contents, this is probably as fast or faster than it
was before.
Tries to avoid generating a new object when the merged content has the same
lines that were in the old object.
I've noticed some merge commits that only move lines around, like this:
- 1323478057.181191s 1 be23c3ac-0ee5-11e0-b185-3b0f9b5b00c5
1323204972.062151s 1 87e06c7a-7388-11e0-ba07-03cdf300bd87
++1323478057.181191s 1 be23c3ac-0ee5-11e0-b185-3b0f9b5b00c5
Unsure if this will really save anything in practice, since it only looks
at one of the two old objects, and maybe I didn't pick the best one.
In git, a Ref can be a Sha, or a Branch, or a Tag. I added type aliases for
those. Note that this does not prevent mixing up of eg, refs and branches
at the type level. Since git really doesn't care, except rare cases like
git update-ref, or git tag -d, that seems ok for now.
There's also a tree-ish, but let's just use Ref for it. A given Sha or Ref
may or may not be a tree-ish, depending on the object type, so there seems
no point in trying to represent it at the type level.
Before, a merge was first calculated, by running various actions that
called git and built up a list of lines, which were at the end sent
to git update-index. This necessarily used space proportional to the size
of the diff between the trees being merged.
Now, lines are streamed into git update-index from each of the actions in
turn.
Runtime size of git-annex merge when merging 50000 location log files
drops from around 100 mb to a constant 4 mb.
Presumably it runs quite a lot faster, too.
This reduces the memory use of a merge by 1/3rd. The space leak was
apparently because the whole update-index input was generated strictly, not
lazily.
I wondered if the change to ByteStrings contributed to this, due to the
need to convert with L.pack here. But going back to the old code, I still
see a much similar leak, and worse performance besides due to it not using
ByteStrings.
The fix is to just hPutStr the lines repeatedly. (Note the \0 is written
separately, to avoid allocation overheads in adding it to the string.)
The Git.pipeWrite interface is probably just wrong for any large inputs to
git. This was the only place using it for input of any size.
There is still at least one other space leak in the merge code.
Many functions took the repo as their first parameter. Changing it
consistently to be the last parameter allows doing some useful things with
currying, that reduce boilerplate.
In particular, g <- gitRepo is almost never needed now, instead
use inRepo to run an IO action in the repo, and fromRepo to get
a value from the repo.
This also provides more opportunities to use monadic and applicative
combinators.
This yields a second or so speedup in unused, find, etc. Seems that even
when the ByteString is immediately split and then converted to Strings,
it's faster.
I may try to push ByteStrings out into more of git-annex gradually,
although I suspect most of the time-critical parts are already covered
now, and many of the rest rely on libraries that only support Strings.