This cache prevented noticing changes made by another process.
The case I just ran into involved the assistant dropping a file, which
cached its presence info. Then the same file was downloaded again,
but the assistant didn't know its presence info had changed.
I don't see a way to keep this cache. Will instead rely on the OS level
file cache, for files in the journal. May need to add more higher-level
caching of info that it's ok to have a potentially stale copy of,
although much of git-annex already does so.
Setting GIT_INDEX_FILE clobbers the rest of the environment, making git
not read ~/.gitconfig, and blow up if GECOS didn't have a name for the
user.
I'm not entirely happy with getEnvironment being run every time now,
that's somewhat expensive. It may make sense to just set GIT_COMMITTER_*
and GIT_AUTHOR_*, but I worry that clobbering the rest could break PATH,
or GIT_PATH, or something else that might be used by a command run in here.
And caching the environment is not a good idea either; it can change..
I'm down to 9 places in the code that can produce unwaited for zombies.
Most of these are pretty innocuous, at least for now, are only
used in short-running commands, or commands that run a set of
actions and explicitly reap zombies after each one.
The one from Annex.Branch.files could be trouble later,
since both Command.Fsck and Command.Unused can trigger it,
and the assistant will be doing those eventally. Ditto the one in
Git.LsTree.lsTree, which Command.Unused uses.
The only ones currently affecting the assistant though, are
in Git.LsFiles. Several threads use several of those.
(And yeah, using pipes or ResourceT would be a less ad-hoc approach,
but I don't really feel like ripping my entire code base apart right
now to change a foundation monad. Maybe one of these days..)
Branch.get is not able to see changes that have been staged to the index
but not committed. This is a limitation of git cat-file --batch; when
reading from the index, as opposed to from a branch, it does not notice
changes made after the first time it reads the index.
So, had to revert the changes made in 1f73db3469
to make annex.alwayscommit=false stage changes.
Also, ensure that Branch.change and Branch.get always see changes
at all points during a commit, by not deleting journal files when
staging to the index. Delete them only after committing the branch.
Before, there was a race during commits where a different git-annex
could see out-of-date info from the branch while a commit was in progress.
That's also done when updating the branch to merge in remote branches.
In the case where the local git-annex branch has had changes pushed into it
that are not yet reflected in the index, and there are journalled changes
as well, a merge commit has to be done.
Commits used to be made to the git-annex branch whenever there were
journalled changes from a previous command, and the current command looked
up the value of a file. This no longer happens.
This means that transferkey, which is a oneshot command that stages
changes, can be run multiple times by the assistant, without each of them
committing the changes made by the command before. Which will be a lot
faster and use less space by batching up the commits.
Commits still happen if a remote git-annex branch has been changed and is
merged in.
Found a very cheap way to determine when a disconnected remote has
diverged, and has new content that needs to be transferred: Piggyback on
the git-annex branch update, which already checks for divergence.
However, this does not check if new content has appeared locally while
disconnected, that should be transferred to the remote.
Also, this does not handle cases where the two git repos are in sync,
but their content syncing has not caught up yet.
This code could have its efficiency improved:
* When multiple remotes are synced, if any one has diverged, they're
all queued for transfer scans.
* The transfer scanner could be told whether the remote has new content,
the local repo has new content, or both, and could optimise its scan
accordingly.
There's a race adding a new file to the annex: The file is moved to the
annex and replaced with a symlink, and then we git add the symlink. If
someone comes along in the meantime and replaces the symlink with
something else, such as a new large file, we add that instead. Which could
be bad..
This race is fixed by avoiding using git add, instead the symlink is
directly staged into the index.
It would be nice to make `git annex add` use this same technique.
I have not done so yet because it currently runs git update-index once per
file, which would slow does `git annex add`. A future enhancement would be
to extend the Git.Queue to include the ability to run update-index with
a list of Streamers.
A bit tricky to avoid printing it twice in a row when there are queued git
commands to run and journal to stage.
Added a generic way to run an action that may output multiple side
messages, with only the first displayed.
Now changes are staged into the branch's index, but not committed,
which avoids growing a large journal. And sync and merge always
explicitly commit, ensuring that even when they do nothing else,
they commit the staged changes.
Added a flag file to indicate that the branch's journal contains
uncommitted changes. (Could use git ls-files, but don't want to run
that every time.)
In the future, this ability to have uncommitted changes staged in the
journal might be used on remotes after a series of oneshot commands.
The list of files had to be retained until the end so it could be deleted.
Also, a list of update-index lines was generated and only then fed into it.
Now everything streams in constant space.
When hashing the files, the entire list of shas was read strictly.
That was entirely unnecessary, since there's a cleanup action run
after they're consumed.
This needs to run git log on the location log files to get at all past
versions of the file, which tends to be a bit slow.
It would be possible to make a version optimised for showing the location
logs for every key. That would only need to run git log once, so would be
faster, but it would need to process an enormous amount of data, so
would not speed up the individual file case.
In the future it would be nice to support log --format. log --json also
doesn't work right yet.
Could have just used hGetContentsStrict here, but that would require
storing all the shas in memory. Since this is called at the end of a
git-annex run, it may have created a *lot* of shas, so I avoid that memory
use and stream them out like before.
Dealing with a race without using locking is exceedingly difficult and tricky.
Fully tested, I hope.
There are three places left where the branch can be updated, that are not
covered by the race recovery code. Let's prove they're all immune to the
race:
1. tryFastForwardTo checks to see if a fast-forward can be done,
and then does git-update-ref on the branch to fast-forward it.
If a push comes in before the check, then either no fast-forward
will be done (ok), or the push set the branch to a ref that can
still be fast-forwarded (also ok)
If a push comes in after the check, the git-update-ref will
undo the ref change made by the push. It's as if the push did not come
in, and the next git-push will see this, and try to re-do it.
(acceptable)
2. When creating the branch for the very first time, an empty index
is created, and a commit of it made to the branch. The commit's ref
is recorded as the current state of the index. If a push came in
during that, it will be noticed the next time a commit is made to the
branch, since the branch will have changed. (ok)
3. Creating the branch from an existing remote branch involves making
the branch, and then getting its ref, and recording that the index
reflects that ref.
If a push creates the branch first, git-branch will fail (ok).
If the branch is created and a racing push is then able to change it
(highly unlikely!) we're still ok, because it first records the ref into
the index.lck, and then updating the index. The race can cause the
index.lck to have the old branch ref, while the index has the newly pushed
branch merged into it, but that only results in an unnecessary update of
the index file later on.
The last branch ref that the index was updated to is stored in
.git/annex/index.lck, and the index only updated when the current
branch ref differs.
(The .lck file should later be used for locking too.)
Some more optimization is still needed, since there is some redundancy in
calls to git show-ref.
Always merge the git-annex branch into .git/annex/index before making a
commit from the index.
This ensures that, when the branch has been changed in any way
(by a push being received, or changes pulled directly into it, or
even by the user checking it out, and committing a change), the index
reflects those changes.
This is much too slow; it needs to be optimised to only update the
index when the branch has really changed, not every time.
Also, there is an unhandled race, when a change is made to the branch
right after the index gets updated. I left it in for now because it's
unlikely and I didn't want to complicate things with additional locking
yet.
The only fully supported thing is to have the main repository on one disk,
and .git/annex on another. Only commands that move data in/out of the annex
will need to copy it across devices.
There is only partial support for putting arbitrary subdirectories of
.git/annex on different devices. For one thing, but this can require more
copies to be done. For example, when .git/annex/tmp is on one device, and
.git/annex/journal on another, every journal write involves a call to
mv(1). Also, there are a few places that make hard links between various
subdirectories of .git/annex with createLink, that are not handled.
In the common case without cross-device, the new moveFile is actually
faster than renameFile, avoiding an unncessary stat to check that a file
(not a directory) is being moved. Of course if a cross-device move is
needed, it is as slow as mv(1) of the data.
The branch may not exist, if .git/annex has been copied over from another
repo (or a corrupted repo). I suppose it could also have gotten deleted
somehow. Without this, there is a confusing failure.
In git, a Ref can be a Sha, or a Branch, or a Tag. I added type aliases for
those. Note that this does not prevent mixing up of eg, refs and branches
at the type level. Since git really doesn't care, except rare cases like
git update-ref, or git tag -d, that seems ok for now.
There's also a tree-ish, but let's just use Ref for it. A given Sha or Ref
may or may not be a tree-ish, depending on the object type, so there seems
no point in trying to represent it at the type level.
Before, a merge was first calculated, by running various actions that
called git and built up a list of lines, which were at the end sent
to git update-index. This necessarily used space proportional to the size
of the diff between the trees being merged.
Now, lines are streamed into git update-index from each of the actions in
turn.
Runtime size of git-annex merge when merging 50000 location log files
drops from around 100 mb to a constant 4 mb.
Presumably it runs quite a lot faster, too.
Avoids doing auto-merging in commands that don't need fully current
information from the git-annex branch. In particular, git annex add no
longer needs to auto-merge. Affected commands: Anything that doesn't
look up data from the branch, but does write a change to it.
It might seem counterintuitive that we can change a value without first
making sure we have the current value. This optimisation works because
these two sequences are equivilant:
1. pull from remote
2. union merge
3. read file from branch
4. modify file and write to branch
vs.
1. read file from branch
2. modify file and write to branch
3. pull from remote
4. union merge
After either sequence, the git-annex branch contains the same logical content
for the modified file. (Possibly with lines in a different order or
additional old lines of course).
Many functions took the repo as their first parameter. Changing it
consistently to be the last parameter allows doing some useful things with
currying, that reduce boilerplate.
In particular, g <- gitRepo is almost never needed now, instead
use inRepo to run an IO action in the repo, and fromRepo to get
a value from the repo.
This also provides more opportunities to use monadic and applicative
combinators.
Thanks Valentin Haenel for a test case showing how non-fast-forward merges
could result in an ongoing pull/merge/push cycle.
While the git-annex branch is fast-forwarded, git-annex's index file is still
updated using the union merge strategy as before. There's no other way to
update the index that would be any faster.
It is possible that a union merge and a fast-forward result in different file
contents: Files should have the same lines, but a union merge may change
their order. If this happens, the next commit made to the git-annex branch
will have some unnecessary changes to line orders, but the consistency
of data should be preserved.
Note that when the journal contains changes, a fast-forward is never attempted,
which is fine, because committing those changes would be vanishingly unlikely
to leave the git-annex branch at a commit that already exists in one of
the remotes.
The real difficulty is handling the case where multiple remotes have all
changed. git-annex does find the best (ie, newest) one and fast forwards
to it. If the remotes are diverged, no fast-forward is done at all. It would
be possible to pick one, fast forward to it, and make a merge commit to
the rest, I see no benefit to adding that complexity.
Determining the best of N changed remotes requires N*2+1 calls to git-log, but
these are fast git-log calls, and N is typically small. Also, typically
some or all of the remote refs will be the same, and git-log is not called to
compare those. In the real world I expect this will almost always add only
1 git-log call to the merge process. (Which already makes N anyway.)
Specifically, disabled trying to update the git-annex branch on the remote,
since that data is never used by operations that act on such remotes.
Also, when copying content to such a remote, skip committing the presence
information changes to its git-annex branch. Leaving it in the journal there
is ok: Any command run on the remote that needs the info will flush the
journal.
This may partially solve this bug:
http://git-annex.branchable.com/bugs/fails_to_handle_lot_of_files/
Although I still see unreaped git processes piling up when doing a copy --to.
Another process may stage journalled files before the lock is
taken, so need to get the list of journalled files afterwards.
It's unfortunate this means getting the directory contents twice,
but it seems better to do that than sometimes take the lock
unnecessarily.