.. Allowing it to be used by things in constant space!
Random statistics: git annex status has gone from taking 239 mb
of memory and 26 seconds in a repo, to 8 mb and 13 seconds.
The trick here is the unsafeInterleaveIO, and the form of the function's
recursion, which I cribbed heavily from System.IO.HVFS.Utils.recurseDirStat.
The difference is, this one goes to a limited depth and avoids statting
everything.
Before, it leaked space due to caching lists of keys. Now all necessary
data about keys is calculated as they stream in.
The "nearly constant" is due to getKeysPresent, which builds up a lot
of [] thunks as it traverses .git/annex/objects/. Will deal with it later.
Now changes are staged into the branch's index, but not committed,
which avoids growing a large journal. And sync and merge always
explicitly commit, ensuring that even when they do nothing else,
they commit the staged changes.
Added a flag file to indicate that the branch's journal contains
uncommitted changes. (Could use git ls-files, but don't want to run
that every time.)
In the future, this ability to have uncommitted changes staged in the
journal might be used on remotes after a series of oneshot commands.
To avoid commits of data to the git-annex branch after each command
is run, set annex.alwayscommit=false. Its data will then be committed
less frequently, when a merge or sync is done.
I was able to reproduce this on linux using the kernel's nfs server and
mounting localhost:/. Determined that removing the directory fails when
the just-deleted file in it was locked. Considered dropping the lock
before removing the directory, but this would complicate parts of the code
that should not need to worry about locking. So instead, ignore the failure
to remove the directory in this case.
While I was at it, made it attempt to remove both levels of hash
directories, in case they're empty.
useful when adding hundreds of thousands of files on a system with plenty
of memory.
git add gets quite slow in such a large repository, so if the system has
more than the ~32 mb of memory the queue can use by default, it's a useful
optimisation to increase the queue size, in order to decrease the number
of times git add is run.
The list of files had to be retained until the end so it could be deleted.
Also, a list of update-index lines was generated and only then fed into it.
Now everything streams in constant space.
When hashing the files, the entire list of shas was read strictly.
That was entirely unnecessary, since there's a cleanup action run
after they're consumed.
Now gitattributes are looked up, efficiently, in only the places that
really need them, using the same approach used for cat-file.
The old CheckAttr code seemed very fragile, in the way it streamed files
through git check-attr.
I actually found that cad8824852
was still deadlocking with ghc 7.4, at the end of adding a lot of files.
This should fix that problem, and avoid future ones.
The best part is that this removes withAttrFilesInGit and withNumCopies,
which were complicated Seek methods, as well as simplfying the types
for several other Seek methods that had a Backend tupled in.
Done by adding a oneshot mode, in which location log changes are written to
the journal, but not committed. Taking advantage of git-annex's existing
ability to recover in this situation.
This is used by git-annex-shell and other places where changes are made to
a remote's location log.
Ssh connection caching is now enabled automatically by git-annex. Only one
ssh connection is made to each host per git-annex run, which can speed some
things up a lot, as well as avoiding repeated password prompts. Concurrent
git-annex processes also share ssh connections. Cached ssh connections are
shut down when git-annex exits.
Note: The rsync special remote does not yet participate in the ssh
connection caching.
For a local git remote, can symlink the file.
For a git remote using rsync, can preseed any local content.
There are a few reasons to use fsck --from on a normal git remote.
One is if it's using gitosis or similar, and you don't have shell access
to run git annex locally. Another reason could be if you just want to
fsck certian files of a bare remote.
This way, the build log will indicate whether StatFS can be relied on.
I've tested all the failing architectures now, and on all of them,
the StatFS code now returns Nothing, rather than Just nonsense.
Also, if annex.diskreserve is set on a platform where StatFS is not
working, git-annex will complain.
Also, the Makefile was missing the sources target used when building with
cabal.
This needs to run git log on the location log files to get at all past
versions of the file, which tends to be a bit slow.
It would be possible to make a version optimised for showing the location
logs for every key. That would only need to run git log once, so would be
faster, but it would need to process an enormous amount of data, so
would not speed up the individual file case.
In the future it would be nice to support log --format. log --json also
doesn't work right yet.
Could have just used hGetContentsStrict here, but that would require
storing all the shas in memory. Since this is called at the end of a
git-annex run, it may have created a *lot* of shas, so I avoid that memory
use and stream them out like before.
Dealing with a race without using locking is exceedingly difficult and tricky.
Fully tested, I hope.
There are three places left where the branch can be updated, that are not
covered by the race recovery code. Let's prove they're all immune to the
race:
1. tryFastForwardTo checks to see if a fast-forward can be done,
and then does git-update-ref on the branch to fast-forward it.
If a push comes in before the check, then either no fast-forward
will be done (ok), or the push set the branch to a ref that can
still be fast-forwarded (also ok)
If a push comes in after the check, the git-update-ref will
undo the ref change made by the push. It's as if the push did not come
in, and the next git-push will see this, and try to re-do it.
(acceptable)
2. When creating the branch for the very first time, an empty index
is created, and a commit of it made to the branch. The commit's ref
is recorded as the current state of the index. If a push came in
during that, it will be noticed the next time a commit is made to the
branch, since the branch will have changed. (ok)
3. Creating the branch from an existing remote branch involves making
the branch, and then getting its ref, and recording that the index
reflects that ref.
If a push creates the branch first, git-branch will fail (ok).
If the branch is created and a racing push is then able to change it
(highly unlikely!) we're still ok, because it first records the ref into
the index.lck, and then updating the index. The race can cause the
index.lck to have the old branch ref, while the index has the newly pushed
branch merged into it, but that only results in an unnecessary update of
the index file later on.