This will speed up typical cases like git-annex get, which currently
has to read the location log once, then read it a second time in order to
add a line to it. Since these reads now involve more than just reading
in a file, it seemed good to add a cache layer.
Only the most recent thing needs to be cached, because git-annex has
good locality; it operates on one file at a time, and only cares
about one item from the branch per file.
This is needed for robust handling of the git-annex branch. Since changes
are staged to its index as git-annex runs, and committed at the end,
it's possible that git-annex is interrupted, and leaves a dirty index.
When it next runs, it needs to be able to merge the git-annex branch
as necessary, without losing the existing changes in the index.
Note that this assumes that the git-annex branch is only modified by
git-annex. Any changes to it will be lost when git-annex updates the
branch. I don't see a good, inexpensive way to find changes in
the git-annex branch that arn't in the index, and union merging the
git-annex branch into the index every time would likewise be expensive.
There is no suitable git hook to run code when pulling changes that
might need to be merged into the git-annex branch. The post-merge hook
is only run when changes are merged into HEAD, and it's possible,
and indeed likely that many pulls will only have changes in git-annex,
but not in HEAD, and not trigger it.
So, git-annex will have to take care to update the branch before reading
from it, to make sure it has merged in current info from remotes. Happily,
this can be done quite inexpensively, just a git-show-ref to list
branches, and a minimalized git-log to see if there are unmerged changes
on the branches. To further speed up, it will be done only once per
git-annex run, max.
Avoided the slow git add, instead inject content directly into git and
populate the index all in one pass. Now this runs on my large real-world
repo in 10 seconds, which is acceptable.
Also lots of code cleanups.
This is a new git subcommand, that does a generic union merge operation
between two refs, storing the result in a branch. It operates efficiently
without touching the working tree. It does need to write out a temporary
index file, and may need to write out some other temp files as well.
This could be useful for anything that stores data in a branch,
and needs to merge changes into that branch without actually checking the
branch out. Since conflict handling can't be done without a working copy,
the merge type is always a union merge, which is fine for data stored in
log format (as git-annex does), or in non-conflicting files
(as pristine-tar does).
This probably belongs in git proper, but it will live in git-annex for now.
---
Plan is to move .git-annex/ to a git-annex branch, and use git-union-merge
to handle merging changes when pulling from remotes.
Some preliminary benchmarking using real .git-annex/ data indicates
that it's quite fast, except for the "git add" call, which is as slow
as "git add" tends to be with a big index.
cp is still used when copying file from repos on the same filesystem, since
--reflink=auto can make it significantly faster on filesystems such as
btrfs.
Directory special remotes still use cp, not rsync. It's not clear what
tmp file should be used when rsyncing to such a remote.
get not honoring --from has surprised me a few times, so least surprise
suggests it should just behave like copy --from. This leaves the difference
between get and copy being that copy always requires the remote to copy
from, while get will decide whether to get a file from a key/value store or
a remote.