git-annex/doc/bugs/problems_with_utf8_names.mdwn

This bug is reopened to track some new UTF-8 filename issues caused by GHC
7.4. In this version of GHC, git-annex's hack to support filenames in any
encoding no longer works. Even unicode filenames fail to work when
git-annex is built with 7.4. --[[Joey]]

This bug is now fixed in current master. Once again, git-annex will work
for all filename encodings, and all system encodings. It will
only build with the new GHC. [[done]] --[[Joey]] 

----

Old, now fixed bug report follows:

There are problems with displaying filenames in UTF8 encoding, as shown here:

    $ echo $LANG
    en_GB.UTF-8
    $ git init
    $ git annex init test
    [...]
    $ touch "Umlaut Ü.txt"
    $ git annex add Uml*
    add Umlaut Ã.txt ok
    (Recording state in git...)
    $ find -name U\* | hexdump -C
    00000000  2e 2f 55 6d 6c 61 75 74  20 c3 9c 2e 74 78 74 0a  |./Umlaut ...txt.|
    00000010
    $ git annex find | hexdump -C
    00000000  55 6d 6c 61 75 74 20 c3  83 c2 9c 2e 74 78 74 0a  |Umlaut .....txt.|
    00000010
    $

It looks like the common latin1-to-UTF8 encoding. Functionality other than otuput seems not to be affected.

> Yes, I believe that git-annex is reading filename data from git
> as a stream of char8s, and not decoding unicode in it into logical
> characters.
> Haskell then I guess, tries to unicode encode it when it's output to
> the console.
> This only seems to matter WRT its output to the console; the data
> does not get mangled internally and so it accesses the right files
> under the hood.
> 
> I am too new to haskell to really have a handle on how to handle
> unicode and other encodings issues with it. In general, there are three
> valid approaches: --[[Joey]] 
> 
> 1. Convert all input data to unicode and be unicode clean end-to-end
>    internally. Problimatic here since filenames may not necessarily be
>    encoded in utf-8 (an archive could have historical filenames using
>    varying encodings), and you don't want which files are accessed to
>    depend on locale settings.
>    > I tried to do this by making parts of GitRepo call
>    > Codec.Binary.UTF8.String.decodeString when reading filenames from
>    > git. This seemed to break attempts to operate on the files,
>    > weirdly encoded strings were seen in syscalls in strace.
> 1. Keep input and internal data un-decoded, but decode it when
>    outputting a filename (assuming the filename is encoded using the
>    user's configured encoding), and allow haskell's output encoding to then
>    encode it according to the user's locale configuration.
>    > This is now implemented. I'm not very happy that I have to watch
>    > out for any place that a filename is output and call `filePathToString`
>    > on it, but there are really not too many such places in git-annex.
>    >
>    > Note that this only affects filenames apparently. 
>    > (Names of files in the annex, and also some places where names
>    > of keys are displayed.) Utf-8 in the uuid.map file etc seems
>    > to be handled cleanly.
> 1. Avoid encodings entirely. Mostly what I'm doing now; probably
>    could find a way to disable encoding of console output. Then the raw
>    filename would be displayed, which should work ok. git-annex does
>    not really need to pull apart filenames; they are almost entirely
>    opaque blobs. I guess that the `--exclude` option is the exception
>    to that, but it is currently not unicode safe anyway. (Update: tried
>    `--exclude` again, seems it is unicode clean..)
>    One other possible
>    issue would be that this could cause problems if git-annex were
>    translated.
>    > On second thought, I switched to this. Any decoding of a filename
>    > is going to make someone unhappy; the previous approach broke
>    > non-utf8 filenames.
reopen People seem to want to post comments here with vague details about a new bug, rather than opening a new bug report. 2012-01-28 22:09:28 +00:00			`This bug is reopened to track some new UTF-8 filename issues caused by GHC`
spent 3 hours on this bug; developed two incomplete fixes 2012-02-01 20:26:23 +00:00			`7.4. In this version of GHC, git-annex's hack to support filenames in any`
			`encoding no longer works. Even unicode filenames fail to work when`
update; newghc-edges branch 2012-02-02 19:41:22 +00:00			`git-annex is built with 7.4. --[[Joey]]`
update 2012-02-02 14:31:56 +00:00
merged ghc 7.4 support into master 2012-02-07 18:15:37 +00:00			`This bug is now fixed in current master. Once again, git-annex will work`
update; ghc7.4 branch fixes this pretty well now 2012-02-03 20:25:34 +00:00			`for all filename encodings, and all system encodings. It will`
merged ghc 7.4 support into master 2012-02-07 18:15:37 +00:00			`only build with the new GHC. [[done]] --[[Joey]]`
reopen People seem to want to post comments here with vague details about a new bug, rather than opening a new bug report. 2012-01-28 22:09:28 +00:00
			`----`

			`Old, now fixed bug report follows:`

but report on umlaut handling 2010-12-22 11:52:16 +00:00			`There are problems with displaying filenames in UTF8 encoding, as shown here:`

			`$ echo $LANG`
			`en_GB.UTF-8`
			`$ git init`
			`$ git annex init test`
			`[...]`
			`$ touch "Umlaut Ü.txt"`
			`$ git annex add Uml*`
			`add Umlaut Ã.txt ok`
			`(Recording state in git...)`
			`$ find -name U\* \| hexdump -C`
			`00000000 2e 2f 55 6d 6c 61 75 74 20 c3 9c 2e 74 78 74 0a \|./Umlaut ...txt.\|`
			`00000010`
			`$ git annex find \| hexdump -C`
			`00000000 55 6d 6c 61 75 74 20 c3 83 c2 9c 2e 74 78 74 0a \|Umlaut .....txt.\|`
			`00000010`
			`$`

			`It looks like the common latin1-to-UTF8 encoding. Functionality other than otuput seems not to be affected.`
details.. 2010-12-28 19:28:23 +00:00
			`> Yes, I believe that git-annex is reading filename data from git`
			`> as a stream of char8s, and not decoding unicode in it into logical`
			`> characters.`
			`> Haskell then I guess, tries to unicode encode it when it's output to`
			`> the console.`
			`> This only seems to matter WRT its output to the console; the data`
			`> does not get mangled internally and so it accesses the right files`
			`> under the hood.`
			`>`
			`> I am too new to haskell to really have a handle on how to handle`
			`> unicode and other encodings issues with it. In general, there are three`
			`> valid approaches: --[[Joey]]`
			`>`
			`> 1. Convert all input data to unicode and be unicode clean end-to-end`
			`> internally. Problimatic here since filenames may not necessarily be`
			`> encoded in utf-8 (an archive could have historical filenames using`
			`> varying encodings), and you don't want which files are accessed to`
			`> depend on locale settings.`
Fix display of unicode filenames. Internally, the filenames are stored as un-decoded unicode. I tried decoding them, but then haskell tries to access the wrong files. Hmm. So, I've unhappily chosen option "B", which is to decode filenames before they are displayed. 2011-02-10 18:21:44 +00:00			`> > I tried to do this by making parts of GitRepo call`
			`> > Codec.Binary.UTF8.String.decodeString when reading filenames from`
			`> > git. This seemed to break attempts to operate on the files,`
			`> > weirdly encoded strings were seen in syscalls in strace.`
details.. 2010-12-28 19:28:23 +00:00			`> 1. Keep input and internal data un-decoded, but decode it when`
			`> outputting a filename (assuming the filename is encoded using the`
			`> user's configured encoding), and allow haskell's output encoding to then`
			`> encode it according to the user's locale configuration.`
reopen People seem to want to post comments here with vague details about a new bug, rather than opening a new bug report. 2012-01-28 22:09:28 +00:00			`> > This is now implemented. I'm not very happy that I have to watch`
update unicode FilePath handling Based on http://hackage.haskell.org/trac/ghc/ticket/3307 , whether FilePath contains decoded unicode varies by OS. So, add a configure check for it. Also, renamed showFile to filePathToString 2011-02-11 19:37:37 +00:00			> > out for any place that a filename is output and call `filePathToString`
Fix display of unicode filenames. Internally, the filenames are stored as un-decoded unicode. I tried decoding them, but then haskell tries to access the wrong files. Hmm. So, I've unhappily chosen option "B", which is to decode filenames before they are displayed. 2011-02-10 18:21:44 +00:00			`> > on it, but there are really not too many such places in git-annex.`
			`> >`
			`> > Note that this only affects filenames apparently.`
			`> > (Names of files in the annex, and also some places where names`
			`> > of keys are displayed.) Utf-8 in the uuid.map file etc seems`
			`> > to be handled cleanly.`
details.. 2010-12-28 19:28:23 +00:00			`> 1. Avoid encodings entirely. Mostly what I'm doing now; probably`
			`> could find a way to disable encoding of console output. Then the raw`
			`> filename would be displayed, which should work ok. git-annex does`
			`> not really need to pull apart filenames; they are almost entirely`
			> opaque blobs. I guess that the `--exclude` option is the exception
update 2011-02-10 18:45:35 +00:00			`> to that, but it is currently not unicode safe anyway. (Update: tried`
			> `--exclude` again, seems it is unicode clean..)
details.. 2010-12-28 19:28:23 +00:00			`> One other possible`
			`> issue would be that this could cause problems if git-annex were`
			`> translated.`
Rethink filename encoding handling for display. Since filename encoding may or may not match locale settings, any attempt to decode filenames will fail for some files. So instead, do all output in binary mode. 2011-03-12 19:30:17 +00:00			`> > On second thought, I switched to this. Any decoding of a filename`
			`> > is going to make someone unhappy; the previous approach broke`
			`> > non-utf8 filenames.`