2010-12-22 11:52:16 +00:00
|
|
|
|
There are problems with displaying filenames in UTF8 encoding, as shown here:
|
|
|
|
|
|
|
|
|
|
$ echo $LANG
|
|
|
|
|
en_GB.UTF-8
|
|
|
|
|
$ git init
|
|
|
|
|
$ git annex init test
|
|
|
|
|
[...]
|
|
|
|
|
$ touch "Umlaut Ü.txt"
|
|
|
|
|
$ git annex add Uml*
|
|
|
|
|
add Umlaut Ã.txt ok
|
|
|
|
|
(Recording state in git...)
|
|
|
|
|
$ find -name U\* | hexdump -C
|
|
|
|
|
00000000 2e 2f 55 6d 6c 61 75 74 20 c3 9c 2e 74 78 74 0a |./Umlaut ...txt.|
|
|
|
|
|
00000010
|
|
|
|
|
$ git annex find | hexdump -C
|
|
|
|
|
00000000 55 6d 6c 61 75 74 20 c3 83 c2 9c 2e 74 78 74 0a |Umlaut .....txt.|
|
|
|
|
|
00000010
|
|
|
|
|
$
|
|
|
|
|
|
|
|
|
|
It looks like the common latin1-to-UTF8 encoding. Functionality other than otuput seems not to be affected.
|
2010-12-28 19:28:23 +00:00
|
|
|
|
|
|
|
|
|
> Yes, I believe that git-annex is reading filename data from git
|
|
|
|
|
> as a stream of char8s, and not decoding unicode in it into logical
|
|
|
|
|
> characters.
|
|
|
|
|
> Haskell then I guess, tries to unicode encode it when it's output to
|
|
|
|
|
> the console.
|
|
|
|
|
> This only seems to matter WRT its output to the console; the data
|
|
|
|
|
> does not get mangled internally and so it accesses the right files
|
|
|
|
|
> under the hood.
|
|
|
|
|
>
|
|
|
|
|
> I am too new to haskell to really have a handle on how to handle
|
|
|
|
|
> unicode and other encodings issues with it. In general, there are three
|
|
|
|
|
> valid approaches: --[[Joey]]
|
|
|
|
|
>
|
|
|
|
|
> 1. Convert all input data to unicode and be unicode clean end-to-end
|
|
|
|
|
> internally. Problimatic here since filenames may not necessarily be
|
|
|
|
|
> encoded in utf-8 (an archive could have historical filenames using
|
|
|
|
|
> varying encodings), and you don't want which files are accessed to
|
|
|
|
|
> depend on locale settings.
|
2011-02-10 18:21:44 +00:00
|
|
|
|
> > I tried to do this by making parts of GitRepo call
|
|
|
|
|
> > Codec.Binary.UTF8.String.decodeString when reading filenames from
|
|
|
|
|
> > git. This seemed to break attempts to operate on the files,
|
|
|
|
|
> > weirdly encoded strings were seen in syscalls in strace.
|
2010-12-28 19:28:23 +00:00
|
|
|
|
> 1. Keep input and internal data un-decoded, but decode it when
|
|
|
|
|
> outputting a filename (assuming the filename is encoded using the
|
|
|
|
|
> user's configured encoding), and allow haskell's output encoding to then
|
|
|
|
|
> encode it according to the user's locale configuration.
|
2011-02-10 18:21:44 +00:00
|
|
|
|
> > This is now [[implemented|done]]. I'm not very happy that I have to watch
|
|
|
|
|
> > out for any place that a filename is output and call `showFile`
|
|
|
|
|
> > on it, but there are really not too many such places in git-annex.
|
|
|
|
|
> >
|
|
|
|
|
> > Note that this only affects filenames apparently.
|
|
|
|
|
> > (Names of files in the annex, and also some places where names
|
|
|
|
|
> > of keys are displayed.) Utf-8 in the uuid.map file etc seems
|
|
|
|
|
> > to be handled cleanly.
|
2010-12-28 19:28:23 +00:00
|
|
|
|
> 1. Avoid encodings entirely. Mostly what I'm doing now; probably
|
|
|
|
|
> could find a way to disable encoding of console output. Then the raw
|
|
|
|
|
> filename would be displayed, which should work ok. git-annex does
|
|
|
|
|
> not really need to pull apart filenames; they are almost entirely
|
|
|
|
|
> opaque blobs. I guess that the `--exclude` option is the exception
|
2011-02-10 18:45:35 +00:00
|
|
|
|
> to that, but it is currently not unicode safe anyway. (Update: tried
|
|
|
|
|
> `--exclude` again, seems it is unicode clean..)
|
2010-12-28 19:28:23 +00:00
|
|
|
|
> One other possible
|
|
|
|
|
> issue would be that this could cause problems if git-annex were
|
|
|
|
|
> translated.
|
2011-02-10 18:58:09 +00:00
|
|
|
|
|
|
|
|
|
----
|
|
|
|
|
|
|
|
|
|
Simpler test case:
|
|
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
|
import Codec.Binary.UTF8.String
|
|
|
|
|
import System.Environment
|
|
|
|
|
|
|
|
|
|
main = do
|
|
|
|
|
args <- getArgs
|
|
|
|
|
let file = decodeString $ head args
|
|
|
|
|
putStrLn $ "file is: " ++ file
|
|
|
|
|
putStr =<< readFile file
|
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
|
|
If I pass this a filename like 'ü', it will fail, and notice
|
|
|
|
|
the bad encoding of the filename in the error message:
|
|
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
|
$ echo hi > ü; runghc foo.hs ü
|
|
|
|
|
file is: ü
|
|
|
|
|
foo.hs: <20>: openFile: does not exist (No such file or directory)
|
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
|
|
On the other hand, if I remove the decodeString, it prints the filename
|
|
|
|
|
wrong, while accessing it right:
|
|
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
|
$ runghc foo.hs ü
|
|
|
|
|
file is: üa
|
|
|
|
|
hi
|
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
|
|
The only way that seems to consistently work is to delay decoding the
|
|
|
|
|
filename to places where it's output. But then it's easy to miss some.
|