Fix mangling of --json output of utf-8 characters when not running in a utf-8 locale
As long as all code imports Utility.Aeson rather than Data.Aeson, and no Strings that may contain utf-8 characters are used for eg, object keys via T.pack, this is guaranteed to fix the problem everywhere that git-annex generates json. It's kind of annoying to need to wrap ToJSON with a ToJSON', especially since every data type that has a ToJSON instance has to be ported over. However, that only took 50 lines of code, which is worth it to ensure full coverage. I initially tried an alternative approach of a newtype FileEncoded, which had to be used everywhere a String was fed into aeson, and chasing down all the sites would have been far too hard. Did consider creating an intentionally overlapping instance ToJSON String, and letting ghc fail to build anything that passed in a String, but am not sure that wouldn't pollute some library that git-annex depends on that happens to use ToJSON String internally. This commit was supported by the NSF-funded DataLad project.
This commit is contained in:
parent
6ddd374935
commit
89e1a05a8f
14 changed files with 173 additions and 62 deletions
|
@ -2,6 +2,25 @@ json is defined as always utf-8. However, when LANG=C,
|
|||
git-annex --json currently outputs "file":"<22><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>"
|
||||
instead of "file":"äöü東" for that utf-8 filename. --[[Joey]]
|
||||
|
||||
(Note that git-annex can operate on non-utf8 filenames; it's not defined
|
||||
what the json contains then, which might or might not be considered a bug
|
||||
but this is not about that.)
|
||||
This can also affect keys when they contain some non-utf8 from eg the
|
||||
extension. And metadata keys and values can contain non-utf8 and also get
|
||||
converted to json with similar results.
|
||||
|
||||
Note that git-annex can operate on non-utf8 filenames and keys;
|
||||
it's not defined what the json contains then, and it currently contains
|
||||
similar garbage.
|
||||
|
||||
This happens because aeson's instance of ToJSON for Char uses
|
||||
Text.singleton, and Text does not handle ghc's filesystem encoding
|
||||
for String. Instead it defaults to `\65533` for each byte encoded with the
|
||||
filesystem encoding.
|
||||
|
||||
So, git-annex will need to convert filenames and keys and anything else
|
||||
that might use the filesystem encoding to Text itself in some
|
||||
way that does respect the filesystem encoding. Ie, use encodeBS to convert
|
||||
it to a ByteString and then Data.Text.Encoding.decodeUtf8.
|
||||
|
||||
> [[done]] that. --[[Joey]]
|
||||
|
||||
What about git-annex commands that take json as input,
|
||||
when run in a non-utf8 locale? Tested that, it is handled ok. --[[Joey]]
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue