Avoided the slow git add, instead inject content directly into git and
populate the index all in one pass. Now this runs on my large real-world
repo in 10 seconds, which is acceptable.
Also lots of code cleanups.
This is a new git subcommand, that does a generic union merge operation
between two refs, storing the result in a branch. It operates efficiently
without touching the working tree. It does need to write out a temporary
index file, and may need to write out some other temp files as well.
This could be useful for anything that stores data in a branch,
and needs to merge changes into that branch without actually checking the
branch out. Since conflict handling can't be done without a working copy,
the merge type is always a union merge, which is fine for data stored in
log format (as git-annex does), or in non-conflicting files
(as pristine-tar does).
This probably belongs in git proper, but it will live in git-annex for now.
---
Plan is to move .git-annex/ to a git-annex branch, and use git-union-merge
to handle merging changes when pulling from remotes.
Some preliminary benchmarking using real .git-annex/ data indicates
that it's quite fast, except for the "git add" call, which is as slow
as "git add" tends to be with a big index.
These are defined in ifelse, but it's not currently available and I don't
want to pull in a library for 6 lines of code anyhow.
Also, ifelse sets the fixity to 1, which does not allow >>? error $ ...
When it's stalled, there are 3 processes:
git annex
git ls-files
git check-attr
git-annex stalls trying to write to git check-attr, which stalls trying to
write to stdout (read by git-annex).
git ls-files does not seem to be involved directly; I've seen the stall when
it was still streaming out the file list, and after it had exited and
zombified.
The read and write are supposed to be handled by two different threads,
which pipeBoth forks off, thus avoiding deadlock. But it does deadlock.
(Certian signals unblock the deadlock for a while, then it stalls again.)
So, this is another case of WTF is the ghc IO manager doing today?
I avoid the issue by converting the writer to a separate process.
Possibly this was caused by some change in ghc 7 -- I'm offline and cannot
verify now, but I'm sure I used to be able to run git annex drop w/o it
hanging! And the code does not seem to have changed, except for commit
c1dc407941, which I tried reverting without
success. In fact, I reverted all the way back to 0.20110316 and still
saw the stall.
Update: Minimal test case:
import System.Cmd.Utils
main = do
as <- checkAttr "blah" $ map show [1..100000]
sequence $ map (putStrLn . show) as
checkAttr attr files = do
(_, s) <- pipeBoth "git" params $ unlines files
return $ lines s
where
params = ["check-attr", attr, "--stdin"]
Bug filed on ghc in debian, #624389
This was a real PITA to fix, since location logs can be staged in
both the current repo, as well as in local remote's repos, in
which case the cwd will not be in the repo. And git add needs different
params in both cases, when absolute paths are not used.
In passing, git annex fsck now stages location log fixes.
The space leak was somehow caused by this line:
absfiles <- mapM absPath files
I confess, I don't quite understand why this caused bad buffering,
but apparently the whole pipeline from git-ls-files backed up at that
point.
Happily, rewriting the code to only get the cwd once and use a pure
function to calculate absfiles clears it up, and should be a little more
efficient in syscalls too.
* Look for dir.git directories the same as git does.
* Support remote urls specified as relative paths.
* Support non-ssh remote paths that contain tilde expansions.
The added check if a repo is bare means its config needs to be read, but
in this case it cannot be. That means that a repo currently not available
is assumed to be non-bare.
This relies on git-annex's behavior of reading the config of local repos.
That allows repoIsLocalBare to examine the git config for core.bare.
Hopefully, gitAnnexLocation, gitAnnexDir, and gitAnnexObjectDir
are only used on local repos. But, I have not audited fully, since
they're probably not (see for example copyToRemote). And so,
the functions fall back to their old non-bare-aware behavior for
non-local repos.
I had not taken into account that the code was written to run git and leave
zombies, for performance/laziness reasons, when I wrote the test suite.
So rather than the typical 1 zombie process that git-annex develops, test
developed dozens. Caused problems on system with low process limits.
Added a reap function to GitRepo, that waits for any zombie child processes.