metadata: Fix encoding problem that led to mojibake when storing metadata strings that contained both unicode characters and a space (or '!') character.
The fix is to stop using w82s, which does not properly reconstitute unicode strings. Instrad, use utf8 bytestring to get the [Word8] to base64. This passes unicode through perfectly, including any invalid filesystem encoded characters. Note that toB64 / fromB64 are also used for creds and cipher embedding. It would be unfortunate if this change broke those uses. For cipher embedding, note that ciphers can contain arbitrary bytes (should really be using ByteString.Char8 there). Testing indicated it's not safe to use the new fromB64 there; I think that characters were incorrectly combined. For credpair embedding, the username or password could contain unicode. Before, that unicode would fail to round-trip through the b64. So, I guess this is not going to break any embedded creds that worked before. This bug may have affected some creds before, and if so, this change will not fix old ones, but should fix new ones at least.
This commit is contained in:
parent
b9ff4a6001
commit
9b93278e8a
8 changed files with 88 additions and 13 deletions
|
@ -13,3 +13,5 @@ Unicode characters in metadata are pruned/converted/lost:
|
|||
### What version of git-annex are you using? On what operating system?
|
||||
|
||||
5.20141125 Debian
|
||||
|
||||
> [[fixed|done]]; test pass. --[[Joey]]
|
||||
|
|
|
@ -0,0 +1,32 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2015-03-04T14:31:21Z"
|
||||
content="""
|
||||
What I'm seeing is the unicode arrow is replaced with 0092 and the elipsis
|
||||
with &. It's losing the other byte.
|
||||
|
||||
The problem seems to be in the base64 encoding that's done, when the metadata
|
||||
value contains spaces or a few other problem characters. These same
|
||||
unicode characters roundtrip through without a problem when not embedded
|
||||
in a string with spaces.
|
||||
|
||||
<pre>
|
||||
*Utility.Base64> let s = "…"
|
||||
*Utility.Base64> (s, fromB64 $ toB64 s)
|
||||
("\8230","&")
|
||||
</pre>
|
||||
|
||||
git-annex also uses base64 for encoding some creds
|
||||
(and for tagged pushes over XMPP, but only the JID is encoded).
|
||||
|
||||
The real culprit is the use of `w82s`, which doesn't handle multi-byte
|
||||
characters. I can easily fix this by using `encodeW8` instead.
|
||||
Audited git-annex for other problem w82s uses and don't see any, so will
|
||||
only need to fix this once.
|
||||
|
||||
Added a quickcheck test for fromB64 . toB64 roundtripping.
|
||||
|
||||
Unfortunately, the entered unicode characters didn't get saved right,
|
||||
so git-annex can do nothing to fix data that was already entered.
|
||||
"""]]
|
Loading…
Add table
Add a link
Reference in a new issue