This is incomplete, it does not honor it yet for hash directories
and other annex bookkeeping files. Some of that is not needed for a bare
repo; some of it may be.
Not only an optimisation. fsck always tried to preventWrite to make sure
file modes are good, and in a shared repo, that will fail on directories
not owned by the current user. Although there may be other problems with
such a setup.
When preparing the debian stable backport, I am seeing a call to statvfs()
succeed, but also set errno to 2 (ENOENT). Not sure why this happens;
I am in a schroot when it does happen, or perhaps stable's libc is a little
broken and sets errno incorrectly. Anyway, it should be perfectly fine to
clear errno after the successful call, rather than before it.
Recursive inotify has beaten me before, with its bad design and races,
but not this time! (I think.) This is able to follow the strongest
filesystem traffic I can throw at it, and robustly notices every file
add and delete. Mostly that's down to Haskell having a quite nice threaded
inotify library (that does its own buffering). A key insight was realizing
that the inotify directory add race could be dealt with by scanning for
files inside newly added directories.
TODO: Add support for freebsd/osx kqueue; see
http://hackage.haskell.org/package/kqueue
Can a git-annex-monitor be far off?
Fix Key directory hash calculation code to behave as it did before version
3.20120227 when a key contains non-ascii.
The hash directories for a given Key are based on its md5sum.
Prior to ghc 7.4, Keys contained raw, undecoded bytes, so the md5sum was
taken of each byte in turn. With the ghc 7.4 filename encoding change,
keys contains decoded unicode characters (possibly with surrigates for
undecodable bytes). This changes the result of the md5sum, since the md5sum
used is pure haskell and supports unicode. And that won't do, as git-annex
will start looking in a different hash directory for the content of a key.
The surrigates are particularly bad, since that's essentially a ghc
implementation detail, so could change again at any time. Also, changing
the locale changes how the bytes are decoded, which can also change
the md5sum.
Symptoms would include things like:
* git annex fsck would complain that no copies existed of a file,
despite its symlink pointing to the content that was locally present
* git annex fix would change the symlink to use the wrong hash
directory.
Only WORM backend is likely to have been affected, since only it tends
to include much filename data (SHA1E could in theory also be affected).
I have not tried to support the hash directories used by git-annex versions
3.20120227 to 3.20120308, so things added with those versions with WORM
will require manual fixups. Sorry for the inconvenience!
This is a straight up pure-code stinker. The relative path calculation
looked for common subdirectories in the two paths, but failed to stop
after the paths diverged. When a later pair of subdirectories were the
same, the resulting relative path was wrong.
Added regression test for this.
The code explicitly switches from HEAD to GET for most redirects.
Possibly because someone misread a spec (which does require switching from
POST to GET for 303 redirects). Or possibly because the spec really is that
bad. Upstream bug: https://github.com/haskell/HTTP/issues/24
Since we absolutely don't want to download entire (large) files from
the web when checking that they exist with HEAD, I wrote my own redirect
follower, based closely on the one used by Network.Browser, but without
this misfeature.
Note that Network.Browser checks that the redirect url is a http url
and fails if not. I don't, because I want to not need to change this
code when it gets https support (related: I'm surprised to see it
doesn't support https yet..). The check does not seem security significant;
it doesn't support file:// urls for example. If a http url is redirected
to https, the Network.Browser will actually make a http connection again.
This could loop, but only up to 5 times.
If there's no Content-Length, or the key has no size, this check is not
done, but it should happen most of the time, and protect against web
content that has changed.
Under ghc 7.4, this seems to be able to handle all filename encodings
again. Including filename encodings that do not match the LANG setting.
I think this will not work with earlier versions of ghc, it uses some ghc
internals.
Turns out that ghc 7.4 has a special filesystem encoding that it uses when
reading/writing filenames (as FilePaths). This encoding is documented
to allow "arbitrary undecodable bytes to be round-tripped through it".
So, to get FilePaths from eg, git ls-files, set the Handle that is reading
from git to use this encoding. Then things basically just work.
However, I have not found a way to make Text read using this encoding.
Text really does assume unicode. So I had to switch back to using String
when reading/writing data to git. Which is a pity, because it's some
percent slower, but at least it works.
Note that stdout and stderr also have to be set to this encoding, or
printing out filenames that contain undecodable bytes causes a crash.
IMHO this is a misfeature in ghc, that the user can pass you a filename,
which you can readFile, etc, but that default, putStr of filename may
cause a crash!
Git.CheckAttr gave me special trouble, because the filenames I got back
from git, after feeding them in, had further encoding breakage.
Rather than try to deal with that, I just zip up the input filenames
with the attributes. Which must be returned in the same order queried
for this to work.
Also of note is an apparent GHC bug I worked around in Git.CheckAttr. It
used to forkProcess and feed git from the child process. Unfortunatly,
after this forkProcess, accessing the `files` variable from the parent
returns []. Not the value that was passed into the function. This screams
of a bad bug, that's clobbering a variable, but for now I just avoid
forkProcess there to work around it. That forkProcess was itself only added
because of a ghc bug, #624389. I've confirmed that the test case for that
bug doesn't reproduce it with ghc 7.4. So that's ok, except for the new ghc
bug I have not isolated and reported. Why does this simple bit of code
magnet the ghc bugs? :)
Also, the symlink touching code is currently broken, when used on utf-8
filenames in a non-utf-8 locale, or probably on any filename containing
undecodable bytes, and I temporarily commented it out.
I had not realized what a memory leak the lazy state monad could be,
although I have not seen much evidence of actual leaking in git-annex.
However, if running git-annex on a great many files, this could matter.
The additional Utility.State.changeState adds even more strictness,
avoiding a problem I saw in github-backup where repeatedly modifying
state built up a huge pile of thunks.
This drops the >>! and >>? with the nice low fixity. IfElse does have
undocumented >>=>>! and >>=>>? operators, but I deem that too fishy.
Anyway, using whenM and unlessM is easier; I sometimes mixed the operators
up.