Now it reads the size specified, rather than using the sentinal hack to
determine EOF.
It still depends on error messages to handle files that are not present.
All commands that often have to read a lot of information from
the git-annex branch should now be nearly as fast as before
the branch was introduced.
Before fsck was taking approximatly 3 hours, now it's running in 8 minutes.
The code is very nasty. It should be rewritten to read the header line
from git cat-file, and then read the specified number of bytes of content.
Only "partially" because the journal is not locked during the merge, so
there's a small window where a different git-annex process could write info
to the journal that overwrites info taken from the merge.
That could be dealt with by locking, but the lock would really need to be
around the whole git-annex, to only let one run at a time. Otherwise, even
with the journal locked during the merge, another git-annex could already
be running, generate an overwriting change, and only store it in the journal
after the merge was complete. And similarly, two git-annex processes could
fight and overwrite each other's information independant of any merging.
So, a toplevel lock for git-annex may get added; it's something I've
considered before, as these potential, unlikely problems are not new.
(OTOH, fsck will deal with such problems.)
git is slow when the index file is large and has to be rewritten each time
a file is changed. To speed this up, added a journal where changes are
recorded before being fed into the index file and committed to the
git-annex branch. The entire journal can be fed into git with just 2
commands, and only one write of the index file.
That sucking sound is a whole page of code vanishing to be replaced with
return . catMaybes . map (logFileKey . takeFileName) =<< Branch.files
What can I say, git is my database, and haskell my copilot.
This fixes precommit, since in that hook, git sets the env var to write
to the lock file, which avoids git add failing due to the presence of the
lock file. (Took me a good hour and a half of confusion to figure this out.)
Test suite now passes 100%! Only the upgrade code still remains to be
written.
This will speed up typical cases like git-annex get, which currently
has to read the location log once, then read it a second time in order to
add a line to it. Since these reads now involve more than just reading
in a file, it seemed good to add a cache layer.
Only the most recent thing needs to be cached, because git-annex has
good locality; it operates on one file at a time, and only cares
about one item from the branch per file.
This is needed for robust handling of the git-annex branch. Since changes
are staged to its index as git-annex runs, and committed at the end,
it's possible that git-annex is interrupted, and leaves a dirty index.
When it next runs, it needs to be able to merge the git-annex branch
as necessary, without losing the existing changes in the index.
Note that this assumes that the git-annex branch is only modified by
git-annex. Any changes to it will be lost when git-annex updates the
branch. I don't see a good, inexpensive way to find changes in
the git-annex branch that arn't in the index, and union merging the
git-annex branch into the index every time would likewise be expensive.
There is no suitable git hook to run code when pulling changes that
might need to be merged into the git-annex branch. The post-merge hook
is only run when changes are merged into HEAD, and it's possible,
and indeed likely that many pulls will only have changes in git-annex,
but not in HEAD, and not trigger it.
So, git-annex will have to take care to update the branch before reading
from it, to make sure it has merged in current info from remotes. Happily,
this can be done quite inexpensively, just a git-show-ref to list
branches, and a minimalized git-log to see if there are unmerged changes
on the branches. To further speed up, it will be done only once per
git-annex run, max.