I had to, I hope temporarily, lose my nice Annex newtype, and use a type
synonym. This because I cannot find a way to derive a MonadBaseControl
instance of the Annex newtype. I've emailed Bas van Dijk in hope he can
help get the newtype back.
Otherwise appears to build & work.
Supporting multiple directory hash types will allow converting to a
different one, without a flag day.
gitAnnexLocation now checks which of the possible locations have a file.
This means more statting of files. Several places currently use
gitAnnexLocation and immediately check if the returned file exists;
those need to be optimised.
The only fully supported thing is to have the main repository on one disk,
and .git/annex on another. Only commands that move data in/out of the annex
will need to copy it across devices.
There is only partial support for putting arbitrary subdirectories of
.git/annex on different devices. For one thing, but this can require more
copies to be done. For example, when .git/annex/tmp is on one device, and
.git/annex/journal on another, every journal write involves a call to
mv(1). Also, there are a few places that make hard links between various
subdirectories of .git/annex with createLink, that are not handled.
In the common case without cross-device, the new moveFile is actually
faster than renameFile, avoiding an unncessary stat to check that a file
(not a directory) is being moved. Of course if a cross-device move is
needed, it is as slow as mv(1) of the data.
The branch may not exist, if .git/annex has been copied over from another
repo (or a corrupted repo). I suppose it could also have gotten deleted
somehow. Without this, there is a confusing failure.
In git, a Ref can be a Sha, or a Branch, or a Tag. I added type aliases for
those. Note that this does not prevent mixing up of eg, refs and branches
at the type level. Since git really doesn't care, except rare cases like
git update-ref, or git tag -d, that seems ok for now.
There's also a tree-ish, but let's just use Ref for it. A given Sha or Ref
may or may not be a tree-ish, depending on the object type, so there seems
no point in trying to represent it at the type level.
Before, a merge was first calculated, by running various actions that
called git and built up a list of lines, which were at the end sent
to git update-index. This necessarily used space proportional to the size
of the diff between the trees being merged.
Now, lines are streamed into git update-index from each of the actions in
turn.
Runtime size of git-annex merge when merging 50000 location log files
drops from around 100 mb to a constant 4 mb.
Presumably it runs quite a lot faster, too.
Avoids doing auto-merging in commands that don't need fully current
information from the git-annex branch. In particular, git annex add no
longer needs to auto-merge. Affected commands: Anything that doesn't
look up data from the branch, but does write a change to it.
It might seem counterintuitive that we can change a value without first
making sure we have the current value. This optimisation works because
these two sequences are equivilant:
1. pull from remote
2. union merge
3. read file from branch
4. modify file and write to branch
vs.
1. read file from branch
2. modify file and write to branch
3. pull from remote
4. union merge
After either sequence, the git-annex branch contains the same logical content
for the modified file. (Possibly with lines in a different order or
additional old lines of course).
git-annex-shell inannex now returns always 0, 1, or 100 (the last when
it's unclear if content is currently in the index due to it currently being
moved or dropped).
(Actual locking code still not yet written.)
The lock will only persist during the perform stage, so the content must
be removed from the annex then, rather than in the cleanup stage.
(No lock is actually taken yet.)
Many functions took the repo as their first parameter. Changing it
consistently to be the last parameter allows doing some useful things with
currying, that reduce boilerplate.
In particular, g <- gitRepo is almost never needed now, instead
use inRepo to run an IO action in the repo, and fromRepo to get
a value from the repo.
This also provides more opportunities to use monadic and applicative
combinators.
Avoid ever using read to parse a non-haskell formatted input string.
show :: Key is arguably still show abuse, but displaying Keys as filenames
is just too useful to give up.
Thanks Valentin Haenel for a test case showing how non-fast-forward merges
could result in an ongoing pull/merge/push cycle.
While the git-annex branch is fast-forwarded, git-annex's index file is still
updated using the union merge strategy as before. There's no other way to
update the index that would be any faster.
It is possible that a union merge and a fast-forward result in different file
contents: Files should have the same lines, but a union merge may change
their order. If this happens, the next commit made to the git-annex branch
will have some unnecessary changes to line orders, but the consistency
of data should be preserved.
Note that when the journal contains changes, a fast-forward is never attempted,
which is fine, because committing those changes would be vanishingly unlikely
to leave the git-annex branch at a commit that already exists in one of
the remotes.
The real difficulty is handling the case where multiple remotes have all
changed. git-annex does find the best (ie, newest) one and fast forwards
to it. If the remotes are diverged, no fast-forward is done at all. It would
be possible to pick one, fast forward to it, and make a merge commit to
the rest, I see no benefit to adding that complexity.
Determining the best of N changed remotes requires N*2+1 calls to git-log, but
these are fast git-log calls, and N is typically small. Also, typically
some or all of the remote refs will be the same, and git-log is not called to
compare those. In the real world I expect this will almost always add only
1 git-log call to the merge process. (Which already makes N anyway.)
Specifically, disabled trying to update the git-annex branch on the remote,
since that data is never used by operations that act on such remotes.
Also, when copying content to such a remote, skip committing the presence
information changes to its git-annex branch. Leaving it in the journal there
is ok: Any command run on the remote that needs the info will flush the
journal.
This may partially solve this bug:
http://git-annex.branchable.com/bugs/fails_to_handle_lot_of_files/
Although I still see unreaped git processes piling up when doing a copy --to.
Another process may stage journalled files before the lock is
taken, so need to get the list of journalled files afterwards.
It's unfortunate this means getting the directory contents twice,
but it seems better to do that than sometimes take the lock
unnecessarily.