Fix display of unicode filenames.

Internally, the filenames are stored as un-decoded unicode.
I tried decoding them, but then haskell tries to access the wrong files.
Hmm.

So, I've unhappily chosen option "B", which is to decode filenames before
they are displayed.
This commit is contained in:
Joey Hess 2011-02-10 14:21:44 -04:00
parent e7a3475704
commit fe55b4644e
11 changed files with 63 additions and 21 deletions

View file

@ -37,10 +37,22 @@ It looks like the common latin1-to-UTF8 encoding. Functionality other than otupu
> encoded in utf-8 (an archive could have historical filenames using
> varying encodings), and you don't want which files are accessed to
> depend on locale settings.
> > I tried to do this by making parts of GitRepo call
> > Codec.Binary.UTF8.String.decodeString when reading filenames from
> > git. This seemed to break attempts to operate on the files,
> > weirdly encoded strings were seen in syscalls in strace.
> 1. Keep input and internal data un-decoded, but decode it when
> outputting a filename (assuming the filename is encoded using the
> user's configured encoding), and allow haskell's output encoding to then
> encode it according to the user's locale configuration.
> > This is now [[implemented|done]]. I'm not very happy that I have to watch
> > out for any place that a filename is output and call `showFile`
> > on it, but there are really not too many such places in git-annex.
> >
> > Note that this only affects filenames apparently.
> > (Names of files in the annex, and also some places where names
> > of keys are displayed.) Utf-8 in the uuid.map file etc seems
> > to be handled cleanly.
> 1. Avoid encodings entirely. Mostly what I'm doing now; probably
> could find a way to disable encoding of console output. Then the raw
> filename would be displayed, which should work ok. git-annex does
@ -50,13 +62,3 @@ It looks like the common latin1-to-UTF8 encoding. Functionality other than otupu
> One other possible
> issue would be that this could cause problems if git-annex were
> translated.
>
> BTW, for more fun, try unsetting LANG, and then you can see
> stuff like this:
joey@gnu:~/tmp/aa>git annex add ./Üa
add add add add git-annex: <stdout>: commitAndReleaseBuffer: invalid
argument (Invalid or incomplete multibyte or wide character)
> (Add -q to work around this; once it doesn't need to print the filename,
> it can act on it ok!)

View file

@ -0,0 +1,33 @@
Try unsetting LANG and passing git-annex unicode filenames.
joey@gnu:~/tmp/aa>git annex add ./Üa
add add add add git-annex: <stdout>: commitAndReleaseBuffer: invalid
argument (Invalid or incomplete multibyte or wide character)
The same problem can be seen with a simple haskell program:
import System.Environment
import Codec.Binary.UTF8.String
main = do
args <- getArgs
putStrLn $ decodeString $ args !! 0
joey@gnu:~/src/git-annex>LANG= runghc ~/foo.hs Ü
foo.hs: <stdout>: hPutChar: invalid argument (Invalid or incomplete multibyte or wide character)
(The call to `decodeString` is necessary to make the input
unicode string be displayed properly in a utf8 locale, but
does not contribute to this problem.)
I guess that haskell is setting the IO encoding to latin1, which
is [documented](http://haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html#v:latin1)
to error out on characters > 255.
So this program doesn't have the problem -- but may output garbage
on non-utf-8 capable terminals:
import System.IO
main = do
hSetEncoding stdout utf8
args <- getArgs
putStrLn $ decodeString $ args !! 0