git-annex/doc/bugs/problems_with_utf8_names.mdwn

This bug is reopened to track some new UTF-8 filename issues caused by GHC
7.4. In this version of GHC, git-annex's hack to support filenames in any
encoding no longer works. Even unicode filenames fail to work when
git-annex is built with 7.4. --[[Joey]]

The new ghc requires a new data type, `RawFilePath` be used if you
don't want to impose utf-8 filenames on your users. I have a `newghc` branch
in git where I am trying to convert it to use `RawFilePath`. However, since
there is no way to cast a `FilePath` to a `RawFilePath` or back (because
the encoding of `RawFilePath` is not specified), this means changing
essentially all of git-annex. Even the filenames used for keys in
`.git/annex/objects` need to use the new data type. --[[Joey]]

> Actually it may not be that bad. A `RawFilePath` contains only bytes,
> so it can be cast to a string, containing encoded characters. That
> string can then be 1) output in binary mode or 2) manipulated
> in ways that do not add characters larger than 255, and cast back to
> a `RawFilePath`. While not type-safe, such casts should at least
> help during bootstrapping, and might allow for a quick fix that only
> changes to `RawFilePath` at the edges.

**As a stopgap workaround**, I have made a branch `unicode-only`. This
makes git-annex work with unicode filenames with ghc 7.4, but *only*
unicode filenames. If you have filenames with some other encoding, you're
out in the cold, and it will probably just crash with a error about wrong
encoding. --[[Joey]]

----

Old, now fixed bug report follows:

There are problems with displaying filenames in UTF8 encoding, as shown here:

    $ echo $LANG
    en_GB.UTF-8
    $ git init
    $ git annex init test
    [...]
    $ touch "Umlaut Ü.txt"
    $ git annex add Uml*
    add Umlaut Ã.txt ok
    (Recording state in git...)
    $ find -name U\* | hexdump -C
    00000000  2e 2f 55 6d 6c 61 75 74  20 c3 9c 2e 74 78 74 0a  |./Umlaut ...txt.|
    00000010
    $ git annex find | hexdump -C
    00000000  55 6d 6c 61 75 74 20 c3  83 c2 9c 2e 74 78 74 0a  |Umlaut .....txt.|
    00000010
    $

It looks like the common latin1-to-UTF8 encoding. Functionality other than otuput seems not to be affected.

> Yes, I believe that git-annex is reading filename data from git
> as a stream of char8s, and not decoding unicode in it into logical
> characters.
> Haskell then I guess, tries to unicode encode it when it's output to
> the console.
> This only seems to matter WRT its output to the console; the data
> does not get mangled internally and so it accesses the right files
> under the hood.
>
> I am too new to haskell to really have a handle on how to handle
> unicode and other encodings issues with it. In general, there are three
> valid approaches: --[[Joey]]
>
> 1. Convert all input data to unicode and be unicode clean end-to-end
>    internally. Problimatic here since filenames may not necessarily be
>    encoded in utf-8 (an archive could have historical filenames using
>    varying encodings), and you don't want which files are accessed to
>    depend on locale settings.
>    > I tried to do this by making parts of GitRepo call
>    > Codec.Binary.UTF8.String.decodeString when reading filenames from
>    > git. This seemed to break attempts to operate on the files,
>    > weirdly encoded strings were seen in syscalls in strace.
> 1. Keep input and internal data un-decoded, but decode it when
>    outputting a filename (assuming the filename is encoded using the
>    user's configured encoding), and allow haskell's output encoding to then
>    encode it according to the user's locale configuration.
>    > This is now implemented. I'm not very happy that I have to watch
>    > out for any place that a filename is output and call `filePathToString`
>    > on it, but there are really not too many such places in git-annex.
>    >
>    > Note that this only affects filenames apparently.
>    > (Names of files in the annex, and also some places where names
>    > of keys are displayed.) Utf-8 in the uuid.map file etc seems
>    > to be handled cleanly.
> 1. Avoid encodings entirely. Mostly what I'm doing now; probably
>    could find a way to disable encoding of console output. Then the raw
>    filename would be displayed, which should work ok. git-annex does
>    not really need to pull apart filenames; they are almost entirely
>    opaque blobs. I guess that the `--exclude` option is the exception
>    to that, but it is currently not unicode safe anyway. (Update: tried
>    `--exclude` again, seems it is unicode clean..)
>    One other possible
>    issue would be that this could cause problems if git-annex were
>    translated.
>    > On second thought, I switched to this. Any decoding of a filename
>    > is going to make someone unhappy; the previous approach broke
>    > non-utf8 filenames.