Under ghc 7.4, this seems to be able to handle all filename encodings
again. Including filename encodings that do not match the LANG setting.
I think this will not work with earlier versions of ghc, it uses some ghc
internals.
Turns out that ghc 7.4 has a special filesystem encoding that it uses when
reading/writing filenames (as FilePaths). This encoding is documented
to allow "arbitrary undecodable bytes to be round-tripped through it".
So, to get FilePaths from eg, git ls-files, set the Handle that is reading
from git to use this encoding. Then things basically just work.
However, I have not found a way to make Text read using this encoding.
Text really does assume unicode. So I had to switch back to using String
when reading/writing data to git. Which is a pity, because it's some
percent slower, but at least it works.
Note that stdout and stderr also have to be set to this encoding, or
printing out filenames that contain undecodable bytes causes a crash.
IMHO this is a misfeature in ghc, that the user can pass you a filename,
which you can readFile, etc, but that default, putStr of filename may
cause a crash!
Git.CheckAttr gave me special trouble, because the filenames I got back
from git, after feeding them in, had further encoding breakage.
Rather than try to deal with that, I just zip up the input filenames
with the attributes. Which must be returned in the same order queried
for this to work.
Also of note is an apparent GHC bug I worked around in Git.CheckAttr. It
used to forkProcess and feed git from the child process. Unfortunatly,
after this forkProcess, accessing the `files` variable from the parent
returns []. Not the value that was passed into the function. This screams
of a bad bug, that's clobbering a variable, but for now I just avoid
forkProcess there to work around it. That forkProcess was itself only added
because of a ghc bug, #624389. I've confirmed that the test case for that
bug doesn't reproduce it with ghc 7.4. So that's ok, except for the new ghc
bug I have not isolated and reported. Why does this simple bit of code
magnet the ghc bugs? :)
Also, the symlink touching code is currently broken, when used on utf-8
filenames in a non-utf-8 locale, or probably on any filename containing
undecodable bytes, and I temporarily commented it out.
Supporting multiple directory hash types will allow converting to a
different one, without a flag day.
gitAnnexLocation now checks which of the possible locations have a file.
This means more statting of files. Several places currently use
gitAnnexLocation and immediately check if the returned file exists;
those need to be optimised.
The only fully supported thing is to have the main repository on one disk,
and .git/annex on another. Only commands that move data in/out of the annex
will need to copy it across devices.
There is only partial support for putting arbitrary subdirectories of
.git/annex on different devices. For one thing, but this can require more
copies to be done. For example, when .git/annex/tmp is on one device, and
.git/annex/journal on another, every journal write involves a call to
mv(1). Also, there are a few places that make hard links between various
subdirectories of .git/annex with createLink, that are not handled.
In the common case without cross-device, the new moveFile is actually
faster than renameFile, avoiding an unncessary stat to check that a file
(not a directory) is being moved. Of course if a cross-device move is
needed, it is as slow as mv(1) of the data.
Many functions took the repo as their first parameter. Changing it
consistently to be the last parameter allows doing some useful things with
currying, that reduce boilerplate.
In particular, g <- gitRepo is almost never needed now, instead
use inRepo to run an IO action in the repo, and fromRepo to get
a value from the repo.
This also provides more opportunities to use monadic and applicative
combinators.
This new approach allows filtering out checks from the default set that are
not appropriate for a command, rather than having to list every check
that is appropriate. It also reduces some boilerplate.
Haskell does not define Eq for functions, so I had to go a long way around
with each check having a unique id. Meh.
These were a mistake, they make the type signatures harder to read and
less flexible. The CommandSeek, CommandStart, CommandPerform, and
CommandCleanup types were a good idea, but composing them with the
parameters expected is going too far.
The tricky part about this is that to generate a key, the file must be
present already. Worked around by adding (back) an URL key type, which
is used for addurl --fast.
This was more complex than would be expected. unannex has to use git commit -a
since it's removing files from git; git commit filelist won't do.
Allow commands to be added to the Git queue that have no associated files,
and run such commands once.
The only remaining vestiage of backends is different types of keys. These
are still called "backends", mostly to avoid needing to change user interface
and configuration. But everything to do with storing keys in different
backends was gone; instead different types of remotes are used.
In the refactoring, lots of code was moved out of odd corners like
Backend.File, to closer to where it's used, like Command.Drop and
Command.Fsck. Quite a lot of dead code was removed. Several data structures
became simpler, which may result in better runtime efficiency. There should
be no user-visible changes.
Since the queue is flushed in between subcommand actions being run,
there should be no issues with actions that expect to queue up some stuff
and have it run after they do other stuff. So I didn't have to audit for
such assumptions.
When adding files to the annex, the symlinks pointing at the annexed
content are made to have the same mtime as the original file. While git
does not preserve that information, this allows a tool like metastore to be
used with annexed files.