details..
This commit is contained in:
parent
022e0c7751
commit
6c58a58393
1 changed files with 42 additions and 0 deletions
|
@ -18,3 +18,45 @@ There are problems with displaying filenames in UTF8 encoding, as shown here:
|
|||
$
|
||||
|
||||
It looks like the common latin1-to-UTF8 encoding. Functionality other than otuput seems not to be affected.
|
||||
|
||||
> Yes, I believe that git-annex is reading filename data from git
|
||||
> as a stream of char8s, and not decoding unicode in it into logical
|
||||
> characters.
|
||||
> Haskell then I guess, tries to unicode encode it when it's output to
|
||||
> the console.
|
||||
> This only seems to matter WRT its output to the console; the data
|
||||
> does not get mangled internally and so it accesses the right files
|
||||
> under the hood.
|
||||
>
|
||||
> I am too new to haskell to really have a handle on how to handle
|
||||
> unicode and other encodings issues with it. In general, there are three
|
||||
> valid approaches: --[[Joey]]
|
||||
>
|
||||
> 1. Convert all input data to unicode and be unicode clean end-to-end
|
||||
> internally. Problimatic here since filenames may not necessarily be
|
||||
> encoded in utf-8 (an archive could have historical filenames using
|
||||
> varying encodings), and you don't want which files are accessed to
|
||||
> depend on locale settings.
|
||||
> 1. Keep input and internal data un-decoded, but decode it when
|
||||
> outputting a filename (assuming the filename is encoded using the
|
||||
> user's configured encoding), and allow haskell's output encoding to then
|
||||
> encode it according to the user's locale configuration.
|
||||
> 1. Avoid encodings entirely. Mostly what I'm doing now; probably
|
||||
> could find a way to disable encoding of console output. Then the raw
|
||||
> filename would be displayed, which should work ok. git-annex does
|
||||
> not really need to pull apart filenames; they are almost entirely
|
||||
> opaque blobs. I guess that the `--exclude` option is the exception
|
||||
> to that, but it is currently not unicode safe anyway.
|
||||
> One other possible
|
||||
> issue would be that this could cause problems if git-annex were
|
||||
> translated.
|
||||
>
|
||||
> BTW, for more fun, try unsetting LANG, and then you can see
|
||||
> stuff like this:
|
||||
|
||||
joey@gnu:~/tmp/aa>git annex add ./Üa
|
||||
add add add add git-annex: <stdout>: commitAndReleaseBuffer: invalid
|
||||
argument (Invalid or incomplete multibyte or wide character)
|
||||
|
||||
> (Add -q to work around this; once it doesn't need to print the filename,
|
||||
> it can act on it ok!)
|
||||
|
|
Loading…
Add table
Reference in a new issue