Fix display of unicode filenames.

Internally, the filenames are stored as un-decoded unicode. I tried decoding them, but then haskell tries to access the wrong files. Hmm. So, I've unhappily chosen option "B", which is to decode filenames before they are displayed.
2011-02-10 14:21:44 -04:00 · 2011-02-10 14:21:44 -04:00 · fe55b4644e
commit fe55b4644e
parent e7a3475704
11 changed files with 63 additions and 21 deletions
--- a/doc/bugs/problems_with_utf8_names.mdwn
+++ b/doc/bugs/problems_with_utf8_names.mdwn
@ -37,10 +37,22 @@ It looks like the common latin1-to-UTF8 encoding. Functionality other than otupu
 >    encoded in utf-8 (an archive could have historical filenames using
 >    varying encodings), and you don't want which files are accessed to
 >    depend on locale settings.
+>    > I tried to do this by making parts of GitRepo call
+>    > Codec.Binary.UTF8.String.decodeString when reading filenames from
+>    > git. This seemed to break attempts to operate on the files,
+>    > weirdly encoded strings were seen in syscalls in strace.
 > 1. Keep input and internal data un-decoded, but decode it when
 >    outputting a filename (assuming the filename is encoded using the
 >    user's configured encoding), and allow haskell's output encoding to then
 >    encode it according to the user's locale configuration.
+>    > This is now [[implemented|done]]. I'm not very happy that I have to watch
+>    > out for any place that a filename is output and call `showFile`
+>    > on it, but there are really not too many such places in git-annex.
+>    >
+>    > Note that this only affects filenames apparently. 
+>    > (Names of files in the annex, and also some places where names
+>    > of keys are displayed.) Utf-8 in the uuid.map file etc seems
+>    > to be handled cleanly.
 > 1. Avoid encodings entirely. Mostly what I'm doing now; probably
 >    could find a way to disable encoding of console output. Then the raw
 >    filename would be displayed, which should work ok. git-annex does
@ -50,13 +62,3 @@ It looks like the common latin1-to-UTF8 encoding. Functionality other than otupu
 >    One other possible
 >    issue would be that this could cause problems if git-annex were
 >    translated.
-> 
-> BTW, for more fun, try unsetting LANG, and then you can see
-> stuff like this:
-
-	joey@gnu:~/tmp/aa>git annex add ./Üa
-	add add add add git-annex: <stdout>: commitAndReleaseBuffer: invalid
-	argument (Invalid or incomplete multibyte or wide character)
-
-> (Add -q to work around this; once it doesn't need to print the filename,
-> it can act on it ok!)
--- a/doc/bugs/unhappy_without_UTF8_locale.mdwn
+++ b/doc/bugs/unhappy_without_UTF8_locale.mdwn
@ -0,0 +1,33 @@
+Try unsetting LANG and passing git-annex unicode filenames.
+
+	joey@gnu:~/tmp/aa>git annex add ./Üa
+	add add add add git-annex: <stdout>: commitAndReleaseBuffer: invalid
+	argument (Invalid or incomplete multibyte or wide character)
+
+The same problem can be seen with a simple haskell program:
+
+	import System.Environment
+	import Codec.Binary.UTF8.String
+	main = do
+	        args <- getArgs
+	        putStrLn $ decodeString $ args !! 0
+
+	joey@gnu:~/src/git-annex>LANG= runghc ~/foo.hs Ü
+	foo.hs: <stdout>: hPutChar: invalid argument (Invalid or incomplete multibyte or wide character)
+
+(The call to `decodeString` is necessary to make the input
+unicode string be displayed properly in a utf8 locale, but
+does not contribute to this problem.)
+
+I guess that haskell is setting the IO encoding to latin1, which
+is [documented](http://haskell.org/ghc/docs/latest/html/libraries/base/System-IO.html#v:latin1)
+to error out on characters > 255. 
+
+So this program doesn't have the problem -- but may output garbage
+on non-utf-8 capable terminals:
+
+	import System.IO
+	main = do
+ 		hSetEncoding stdout utf8
+	        args <- getArgs
+	        putStrLn $ decodeString $ args !! 0