All commands that often have to read a lot of information from
the git-annex branch should now be nearly as fast as before
the branch was introduced.
Before fsck was taking approximatly 3 hours, now it's running in 8 minutes.
The code is very nasty. It should be rewritten to read the header line
from git cat-file, and then read the specified number of bytes of content.
Since the logs have just been moved into the git-annex branch, don't need
to worry about backwards compatability with old versions of git-annex that
would fail to parse location logs with extra fields tacked on.
This is a new git subcommand, that does a generic union merge operation
between two refs, storing the result in a branch. It operates efficiently
without touching the working tree. It does need to write out a temporary
index file, and may need to write out some other temp files as well.
This could be useful for anything that stores data in a branch,
and needs to merge changes into that branch without actually checking the
branch out. Since conflict handling can't be done without a working copy,
the merge type is always a union merge, which is fine for data stored in
log format (as git-annex does), or in non-conflicting files
(as pristine-tar does).
This probably belongs in git proper, but it will live in git-annex for now.
---
Plan is to move .git-annex/ to a git-annex branch, and use git-union-merge
to handle merging changes when pulling from remotes.
Some preliminary benchmarking using real .git-annex/ data indicates
that it's quite fast, except for the "git add" call, which is as slow
as "git add" tends to be with a big index.
cp is still used when copying file from repos on the same filesystem, since
--reflink=auto can make it significantly faster on filesystems such as
btrfs.
Directory special remotes still use cp, not rsync. It's not clear what
tmp file should be used when rsyncing to such a remote.
get not honoring --from has surprised me a few times, so least surprise
suggests it should just behave like copy --from. This leaves the difference
between get and copy being that copy always requires the remote to copy
from, while get will decide whether to get a file from a key/value store or
a remote.
Avoid git reset here too, so I no longer need to care that it's much more
expensive than seems wise (but I asked the git list about that anyway).
It's not necessary to reset the staged file content from the index, as
the `git add` of the the symlink will replace it anyway.
`git commit` of unlocked files is still slow, since git still has to shove
their entire content into the index, only to have it be thrown away. So it's
still better to use `git annex add`
This takes advantage of the debug logging done by missingh, and I added
my own debug messages for executeFile calls. There are still some other
low-level ways git-annex runs stuff that are not shown by debugging,
but this gets most of it easily.
In particular, munge key filenames to comply with the IA's filename limits,
disable encryption, support their nonstandard way of creating buckets, and
allow x-amz-* headers to be specified in initremote to set item metadata.
Still TODO: initremote does not handle multiword metadata headers right.
Releasing before I have quite finished the code. Got a little caught
up in Anathem references. Time for a walk and then a tiny bit more coding
and possibly testing.
When it's stalled, there are 3 processes:
git annex
git ls-files
git check-attr
git-annex stalls trying to write to git check-attr, which stalls trying to
write to stdout (read by git-annex).
git ls-files does not seem to be involved directly; I've seen the stall when
it was still streaming out the file list, and after it had exited and
zombified.
The read and write are supposed to be handled by two different threads,
which pipeBoth forks off, thus avoiding deadlock. But it does deadlock.
(Certian signals unblock the deadlock for a while, then it stalls again.)
So, this is another case of WTF is the ghc IO manager doing today?
I avoid the issue by converting the writer to a separate process.
Possibly this was caused by some change in ghc 7 -- I'm offline and cannot
verify now, but I'm sure I used to be able to run git annex drop w/o it
hanging! And the code does not seem to have changed, except for commit
c1dc407941, which I tried reverting without
success. In fact, I reverted all the way back to 0.20110316 and still
saw the stall.
Update: Minimal test case:
import System.Cmd.Utils
main = do
as <- checkAttr "blah" $ map show [1..100000]
sequence $ map (putStrLn . show) as
checkAttr attr files = do
(_, s) <- pipeBoth "git" params $ unlines files
return $ lines s
where
params = ["check-attr", attr, "--stdin"]
Bug filed on ghc in debian, #624389
Fully tested and working, including resuming and encryption. (Though not
resuming when sending *with* encryption; gpg doesn't produce identical
output each time.)
Uses same layout as the directory special remote and the .git/annex/objects/
directory.
This was a real PITA to fix, since location logs can be staged in
both the current repo, as well as in local remote's repos, in
which case the cwd will not be in the repo. And git add needs different
params in both cases, when absolute paths are not used.
In passing, git annex fsck now stages location log fixes.
The test suite will not be run if it cannot be compiled.
It may be possible later to split off the quickcheck using tests into
a separate program and keep most of the tests using just hunit.
The remaining leaks are in hS3. The leak with encryption was worked around
by the use of the temp file. (And was probably originally caused by
gpgCipherHandle sparking a thread which kept a reference to the start
of the byte string.)
* Update Debian build dependencies for ghc 7.
* Debian package is now built with S3 support. Thanks Joachim Breitner for
making this possible, also thanks Greg Heartsfield for working to improve
the hS3 library for git-annex.
Also hid a conflicting new symbol from Control.Monad.State
This was a most surprising leak. It occurred in the process that is forked
off to feed data to gpg. That process was passed a lazy ByteString of
input, and ghc seemed to not GC the ByteString as it was lazily read
and consumed, so memory slowly leaked as the file was read and passed
through gpg to bup.
To fix it, I simply changed the feeder to take an IO action that returns
the lazy bytestring, and fed the result directly to hPut.
AFAICS, this should change nothing WRT buffering. But somehow it makes
ghc's GC do the right thing. Probably I triggered some weakness in ghc's
GC (version 6.12.1).
(Note that S3 still has this leak, and others too. Fixing it will involve
another dance with the type system.)
Update: One theory I have is that this has something to do with
the forking of the feeder process. Perhaps, when the ByteString
is produced before the fork, ghc decides it need to hold a pointer
to the start of it, for some reason -- maybe it doesn't realize that
it is only used in the forked process.
Stalls were caused by code that did approximatly:
content' <- liftIO $ withEncryptedContent cipher content return
store content'
The return evaluated without actually reading content from S3,
and so the cleanup code began waiting on gpg to exit before
gpg could send all its data.
Fixing it involved moving the `store` type action into the IO monad:
liftIO $ withEncryptedContent cipher content store
Which was a bit of a pain to do, thank you type system, but
avoids the problem as now the whole content is consumed, and
stored, before cleanup.
Since the queue is flushed in between subcommand actions being run,
there should be no issues with actions that expect to queue up some stuff
and have it run after they do other stuff. So I didn't have to audit for
such assumptions.
to avoid some issues with git on OSX with the mixed-case directories. No
migration is needed; the old mixed case hash directories are still read;
new information is written to the new directories.
So, it would be nicer to just use Cabal and take advantage
of its conditional compilation support. But, Cabal seems to
lack good support for a package with an internal library that is used by
multiple executables. It wants to build everything twice or more.
That's too slow for me.
Anyway, fairly soon, I expect to upgrade hS3 to a requirment, and I
can just revert this.