From 828df56453d9b0a1483d5c85e6ca739b158883d3 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 2 Feb 2012 15:41:22 -0400 Subject: [PATCH] update; newghc-edges branch --- doc/bugs/problems_with_utf8_names.mdwn | 73 +++++++++++++++++--------- 1 file changed, 49 insertions(+), 24 deletions(-) diff --git a/doc/bugs/problems_with_utf8_names.mdwn b/doc/bugs/problems_with_utf8_names.mdwn index c9ca1e3b07..ac110a6aed 100644 --- a/doc/bugs/problems_with_utf8_names.mdwn +++ b/doc/bugs/problems_with_utf8_names.mdwn @@ -1,35 +1,60 @@ This bug is reopened to track some new UTF-8 filename issues caused by GHC 7.4. In this version of GHC, git-annex's hack to support filenames in any encoding no longer works. Even unicode filenames fail to work when -git-annex is built with 7.4. --[[Joey]] - -> What's going on exactly? The new ghc, when presented with -> a String of raw bytes like "fo\194\161", and asked to do -> something like `getSymbolicLinkStatus`, encodes it -> as unicode, yielding "fo\303\202\302\241". Which is -> not the same as the original filename. - -The new ghc requires a new data type, `RawFilePath` be used if you -don't want to impose utf-8 filenames on your users. I have a `newghc` branch -in git where I am trying to convert it to use `RawFilePath`. However, since -there is no way to cast a `FilePath` to a `RawFilePath` or back (because -the encoding of `RawFilePath` is not specified), this means changing -essentially all of git-annex. Even the filenames used for keys in -`.git/annex/objects` need to use the new data type. --[[Joey]] - -> Actually it may not be that bad. A `RawFilePath` contains only bytes, -> so it can be cast to a string, containing encoded characters. That -> string can then be 1) output in binary mode or 2) manipulated -> in ways that do not add characters larger than 255, and cast back to -> a `RawFilePath`. While not type-safe, such casts should at least -> help during bootstrapping, and might allow for a quick fix that only -> changes to `RawFilePath` at the edges. +git-annex is built with 7.4. --[[Joey]] **As a stopgap workaround**, I have made a branch `unicode-only`. This makes git-annex work with unicode filenames with ghc 7.4, but *only* unicode filenames. If you have filenames with some other encoding, you're out in the cold, and it will probably just crash with a error about wrong -encoding. --[[Joey]] +encoding. + +## analysis + +What's going on exactly? The new ghc, when presented with +a String of raw bytes like "fo\194\161", and asked to do +something like `getSymbolicLinkStatus`, encodes it +as unicode, yielding "fo\303\202\302\241". Which is +not the same as the original filename, assuming it was "fo¡". + +The new ghc requires a new data type, `RawFilePath` be used if you +don't want to impose utf-8 filenames on your users. + +The available `RawFilePath` support is quite low-level, so all the nice +readFile and writeFile code, etc has to be reimplemented. So do any utility +libraries that do things with FilePaths, if you need them to use +RawFilePaths. Until the haskell ecosystem adapts to `RawFilePath` +(if it does), using it broadly, as git-annex needs to, will be difficult. + +## newghc branch + +I have a `newghc` branch in git where I am trying to convert it to use +`RawFilePath`. However, since there is no way to cast a `FilePath` to a +`RawFilePath` or back (because the encoding of `RawFilePath` is not +specified), this means changing essentially all of git-annex. Even the +filenames used for keys in `.git/annex/objects` need to use the new data +type. I didn't get very far on this branch. + +## newghc-edges branch + +I have a `newghc-edges` branch in git, trying a different approach. + +A `RawFilePath` contains only bytes, so it can actually be cast to a string, +containing encoded characters. That string can then be 1) output in binary +mode or 2) manipulated in ways that do not add characters larger than 255, +and cast back to a `RawFilePath`. While not type-safe, such casts should at +least help during bootstrapping, and might allow for a quick fix that only +changes to `RawFilePath` at the edges. + +The branch contains an almost complete, although probably also buggy +conversion using this method. It is missing wrappers for a +few things like `readFile` and `writeFile` but otherwise seems to +basically work. + +Is this a suitable approach for merging into `master`? It's nasty, +being not type safe, having to reimplent/copy+modify random bits of +libraries, etc. The nastiness is contained, though, in a single file, +of only a few hundred lines of code. --[[Joey]] ----