openSUSE patches rsync with a patch adding SIP protocol support.
https://gist.github.com/2026167
With this patch, running rsync with no hostname parameter is apparently
supposed to list SIP hosts on the network. Practically, it does nothing
and exits 0.
git-annex uses rsync in a very special way to allow git-annex-shell to be
run on the remote host, and so did not need to specify a hostname, or a
file to transfer as a rsync parameter. So it sent ":", a degenerate case of
"host:file".
But the patch cannot differentiate ":" with no host parameter
(a bug in the SIP patch surely).
Results were that getting files failed, as rsync seemed to succeed, but the
requested file failed to arrive. Also I think that sending files will
make git-annex think a file has been transferred to the remote when
really rsync does nothing.
The workaround for this buggy rsync patch is to use "dummy:" as the
hostname.
Add tuning, docs, etc.
Not sure if status is the right place to remote size.. perhaps unused
should report the size and also warn if it sees more keys than the bloom
filter allows?
Can't trust the key size to be accurate for tmp and bad keys, so check
actual file size. In the wild I saw the old code be wrong by a factor
of about 100!
If all tmp/bad keys are empty, they're not shown in status at all.
Showing 0 bytes and suggesting to clean it up seemed weird..
Stale and bad files are rare, so it's more efficient to use inAnnex to see
if they can be deleted, rather than keeping the list of all present keys
around for them.
.. Allowing it to be used by things in constant space!
Random statistics: git annex status has gone from taking 239 mb
of memory and 26 seconds in a repo, to 8 mb and 13 seconds.
The trick here is the unsafeInterleaveIO, and the form of the function's
recursion, which I cribbed heavily from System.IO.HVFS.Utils.recurseDirStat.
The difference is, this one goes to a limited depth and avoids statting
everything.
Before, it leaked space due to caching lists of keys. Now all necessary
data about keys is calculated as they stream in.
The "nearly constant" is due to getKeysPresent, which builds up a lot
of [] thunks as it traverses .git/annex/objects/. Will deal with it later.
Much of the memory bloat turned out to be due to getKeysReferenced
containing a mapM, which is strict and buffered the whole list
rather than streaming it.
The other half of the bloat was due to building a temporary Set
in order to call S.difference. While that is more cpu efficient,
I switched to successive S.delete, since with it, I can run a whole
git annex unused in less than 8 mb of memory.
The whole Set of keys with content available is still stored in memory,
so running unused in a repo with a whole lot of file content will still
use more memory. In a repo containing 6000 files, it needed 40 mb.
Note that the status command still uses the bloatful getKeysReferenced.
This has two benefits.
1. When a lot of refs are going to be received, get them via lower cost
connection when possible.
2. Allows ctrl-c of sync after the cheaper remotes have been pulled from
(or pushed to).
Rather than running make, which runs configure, let Setup.hs just include
the configure code. The standalone configure is retained for use by the
Makefile.
This may work better with cabal-dev, since it avoids the Makefile running
ghc, and lets cabal handle all the compiler running, with whatever
flags it uses to expose dependencies.
Fix Key directory hash calculation code to behave as it did before version
3.20120227 when a key contains non-ascii.
The hash directories for a given Key are based on its md5sum.
Prior to ghc 7.4, Keys contained raw, undecoded bytes, so the md5sum was
taken of each byte in turn. With the ghc 7.4 filename encoding change,
keys contains decoded unicode characters (possibly with surrigates for
undecodable bytes). This changes the result of the md5sum, since the md5sum
used is pure haskell and supports unicode. And that won't do, as git-annex
will start looking in a different hash directory for the content of a key.
The surrigates are particularly bad, since that's essentially a ghc
implementation detail, so could change again at any time. Also, changing
the locale changes how the bytes are decoded, which can also change
the md5sum.
Symptoms would include things like:
* git annex fsck would complain that no copies existed of a file,
despite its symlink pointing to the content that was locally present
* git annex fix would change the symlink to use the wrong hash
directory.
Only WORM backend is likely to have been affected, since only it tends
to include much filename data (SHA1E could in theory also be affected).
I have not tried to support the hash directories used by git-annex versions
3.20120227 to 3.20120308, so things added with those versions with WORM
will require manual fixups. Sorry for the inconvenience!