convert encode_c to ByteString

This turns out to be possible after all, because the old one decomposed a unicode Char to multiple Word8s and encoded those. It should be faster in some places, particularly in Git.Filename.encodeAlways. The old version encoded all unicode by default as well as ascii control characters and also '"'. The new one only encodes ascii control characters by default. That old behavior was visible in Utility.Format.format, which did escape '"' when used in eg git-annex find --format='${escaped_file}\n' So made sure to keep that working the same. Although the man page only says it will escape "unusual" characters, so it might be able to be changed. Git.Filename.encodeAlways also needs to escape '"' ; that was the original reason that was escaped. Types.Transferrer I judge is ok to not escape '"', because the escaped value is sent in a line-based protocol, which is decoded at the other end by decode_c. So old git-annex and new will be fine whether that is escaped or not, the result will be the same. Note that when asked to escape a double quote, it is escaped to \" rather than to \042. That's the same behavior as git has. It's perhaps somehow more of a special case than it needs to be. Sponsored-by: k0ld on Patreon
2023-04-07 16:47:26 -04:00 · 2023-04-07 16:47:26 -04:00 · d9b6be7782
commit d9b6be7782
parent 371d4f8183
3 changed files with 66 additions and 45 deletions
--- a/Git/Filename.hs
+++ b/Git/Filename.hs
@ -1,15 +1,17 @@
 {- Some git commands output encoded filenames, in a rather annoyingly complex
 - C-style encoding.
 -
- - Copyright 2010, 2011 Joey Hess <id@joeyh.name>
+ - Copyright 2010-2023 Joey Hess <id@joeyh.name>
 -
 - Licensed under the GNU AGPL version 3 or higher.
 -}

+{-# LANGUAGE OverloadedStrings #-}
+
 module Git.Filename where

 import Common
-import Utility.Format (decode_c, encode_c)
+import Utility.Format (decode_c, encode_c, isUtf8Byte)
 import Utility.QuickCheck

 import Data.Char
@ -31,9 +33,11 @@ decode b = case S.uncons b of
  	q :: Word8
 	q = fromIntegral (ord '"')

-{- Should not need to use this, except for testing decode. -}
-encode :: RawFilePath -> S.ByteString
-encode s = encodeBS $ "\"" ++ encode_c (decodeBS s) ++ "\""
+-- always encodes and double quotes, even in cases that git does not
+encodeAlways :: RawFilePath -> S.ByteString
+encodeAlways s = "\"" <> encode_c needencode s <> "\""
+  where
+	needencode c = isUtf8Byte c || c == fromIntegral (ord '"')

 -- Encoding and then decoding roundtrips only when the string does not
 -- contain high unicode, because eg,  both "\12345" and "\227\128\185"
@ -43,6 +47,6 @@ encode s = encodeBS $ "\"" ++ encode_c (decodeBS s) ++ "\""
 -- limits what's tested to ascii, so avoids running into it.
 prop_encode_decode_roundtrip :: TestableFilePath -> Bool
 prop_encode_decode_roundtrip ts = 
-	s == fromRawFilePath (decode (encode (toRawFilePath s)))
+	s == fromRawFilePath (decode (encodeAlways (toRawFilePath s)))
  where
 	s = fromTestableFilePath ts