git-annex/Git/Filename.hs

{- Some git commands output encoded filenames, in a rather annoyingly complex
 - C-style encoding.
 -
 - Copyright 2010, 2011 Joey Hess <id@joeyh.name>
 -
 - Licensed under the GNU AGPL version 3 or higher.
 -}

module Git.Filename where

import Common
import Utility.Format (decode_c, encode_c)
import Utility.QuickCheck

import Data.Char
import Data.Word
import qualified Data.ByteString as S

-- encoded filenames will be inside double quotes
decode :: S.ByteString -> RawFilePath
decode b = case S.uncons b of
	Nothing -> b
	Just (h, t)
		| h /= q -> b
		| otherwise -> case S.unsnoc t of
			Nothing -> b
			Just (i, l)
				| l /= q -> b
				| otherwise ->
					encodeBS $ decode_c $ decodeBS i
  where
  	q :: Word8
	q = fromIntegral (ord '"')

{- Should not need to use this, except for testing decode. -}
encode :: RawFilePath -> S.ByteString
encode s = encodeBS $ "\"" ++ encode_c (decodeBS s) ++ "\""

-- Encoding and then decoding roundtrips only when the string does not
-- contain high unicode, because eg,  both "\12345" and "\227\128\185"
-- are encoded to "\343\200\271".
--
-- That is not a real-world problem, and using TestableFilePath
-- limits what's tested to ascii, so avoids running into it.
prop_encode_decode_roundtrip :: TestableFilePath -> Bool
prop_encode_decode_roundtrip ts = 
	s == fromRawFilePath (decode (encode (toRawFilePath s)))
  where
	s = fromTestableFilePath ts
split out two more Git modules 2011-12-13 19:22:43 +00:00			`{- Some git commands output encoded filenames, in a rather annoyingly complex`
			`- C-style encoding.`
			`-`
update my email address and homepage url 2015-01-21 16:50:09 +00:00			`- Copyright 2010, 2011 Joey Hess <id@joeyh.name>`
split out two more Git modules 2011-12-13 19:22:43 +00:00			`-`
update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.) 2019-03-13 19:48:14 +00:00			`- Licensed under the GNU AGPL version 3 or higher.`
split out two more Git modules 2011-12-13 19:22:43 +00:00			`-}`

			`module Git.Filename where`

fix failing quickcheck properties QuickCheck 2.10 found a counterexample eg "\929184" broke the property. As far as I can tell, Git.Filename is matching how git handles encoding of strange high unicode characters in filenames for display. Git does not display high unicode characters, and instead displays the C-style escaped form of each byte. This is ambiguous, but since git is not unicode aware, it doesn't need to roundtrip parse it. So, making Git.FileName's roundtrip test only chars < 256 seems fine. Utility.Format.format uses encode_c, in order to mimic git, so that's ok. Utility.Format.gen uses decode_c, but only so that stuff like "\n" in the format string is handled. If the format string contains C-style octal escapes, they will be converted to ascii characters, and not combined into unicode characters, but that should not be a problem. If the user wants unicode characters, they can include them in the format string, without escaping them. Finally, decode_c is used by Utility.Gpg.secretKeys, because gpg --with-colons hex-escapes some characters in particular ':' and '\\'. gpg passes unicode through, so this use of decode_c is not a problem. This commit was sponsored by Henrik Riomar on Patreon. 2017-06-17 20:17:09 +00:00			`import Common`
handle C-style escapes in Format I was happily able to repurpose some code from Git.Filename to handle this. I remember writing that code... a whole afternoon at a coffee shop, after which I felt I'd struggled with Haskell and git, and sorta lost, in needing to write this nasty peice of code. But was also pleased at the use of a pair of functions and quickcheck that allowed me to get it 100% right. So, turns out I not only got it right, but the code wasn't as special-purpose as I'd feared. Yay! 2011-12-23 00:14:35 +00:00			`import Utility.Format (decode_c, encode_c)`
add newtypes for QuickCheck to avoid LANG=C issues All properties changed to use them, except for prop_encode_c_decode_c_roundtrip, which already filtered to ascii for other reasons. A few modules had to be split out, because Setup does not build-depend on QuickCheck. 2020-11-10 00:07:31 +00:00			`import Utility.QuickCheck`
split out two more Git modules 2011-12-13 19:22:43 +00:00
fix failing quickcheck properties QuickCheck 2.10 found a counterexample eg "\929184" broke the property. As far as I can tell, Git.Filename is matching how git handles encoding of strange high unicode characters in filenames for display. Git does not display high unicode characters, and instead displays the C-style escaped form of each byte. This is ambiguous, but since git is not unicode aware, it doesn't need to roundtrip parse it. So, making Git.FileName's roundtrip test only chars < 256 seems fine. Utility.Format.format uses encode_c, in order to mimic git, so that's ok. Utility.Format.gen uses decode_c, but only so that stuff like "\n" in the format string is handled. If the format string contains C-style octal escapes, they will be converted to ascii characters, and not combined into unicode characters, but that should not be a problem. If the user wants unicode characters, they can include them in the format string, without escaping them. Finally, decode_c is used by Utility.Gpg.secretKeys, because gpg --with-colons hex-escapes some characters in particular ':' and '\\'. gpg passes unicode through, so this use of decode_c is not a problem. This commit was sponsored by Henrik Riomar on Patreon. 2017-06-17 20:17:09 +00:00			`import Data.Char`
wip RawFilePath Goal is to make git-annex faster by using ByteString for all the worktree traversal. For now, this is focusing on Command.Find, in order to benchmark how much it helps. (All other commands are temporarily disabled) Currently in a very bad unbuildable in-between state. 2019-11-25 20:18:19 +00:00			`import Data.Word`
			`import qualified Data.ByteString as S`
use Common in a few more modules 2011-12-20 18:37:53 +00:00
wip RawFilePath Goal is to make git-annex faster by using ByteString for all the worktree traversal. For now, this is focusing on Command.Find, in order to benchmark how much it helps. (All other commands are temporarily disabled) Currently in a very bad unbuildable in-between state. 2019-11-25 20:18:19 +00:00			`-- encoded filenames will be inside double quotes`
			`decode :: S.ByteString -> RawFilePath`
			`decode b = case S.uncons b of`
			`Nothing -> b`
			`Just (h, t)`
			`\| h /= q -> b`
			`\| otherwise -> case S.unsnoc t of`
			`Nothing -> b`
			`Just (i, l)`
			`\| l /= q -> b`
			`\| otherwise ->`
			`encodeBS $ decode_c $ decodeBS i`
			`where`
			`q :: Word8`
			`q = fromIntegral (ord '"')`
split out two more Git modules 2011-12-13 19:22:43 +00:00
			`{- Should not need to use this, except for testing decode. -}`
wip RawFilePath Goal is to make git-annex faster by using ByteString for all the worktree traversal. For now, this is focusing on Command.Find, in order to benchmark how much it helps. (All other commands are temporarily disabled) Currently in a very bad unbuildable in-between state. 2019-11-25 20:18:19 +00:00			`encode :: RawFilePath -> S.ByteString`
			`encode s = encodeBS $ "\"" ++ encode_c (decodeBS s) ++ "\""`
split out two more Git modules 2011-12-13 19:22:43 +00:00
add newtypes for QuickCheck to avoid LANG=C issues All properties changed to use them, except for prop_encode_c_decode_c_roundtrip, which already filtered to ascii for other reasons. A few modules had to be split out, because Setup does not build-depend on QuickCheck. 2020-11-10 00:07:31 +00:00			`-- Encoding and then decoding roundtrips only when the string does not`
			`-- contain high unicode, because eg, both "\12345" and "\227\128\185"`
			`-- are encoded to "\343\200\271".`
			`--`
			`-- That is not a real-world problem, and using TestableFilePath`
			`-- limits what's tested to ascii, so avoids running into it.`
			`prop_encode_decode_roundtrip :: TestableFilePath -> Bool`
			`prop_encode_decode_roundtrip ts =`
			`s == fromRawFilePath (decode (encode (toRawFilePath s)))`
add back lost filtering of multibyte chars in prop_encode_decode_roundtrip I had thought using ByteString would avoid the problem, but the quickcheck property is still taking Arbitrary String input, so the use of ByteString internally doesn't matter. 2019-12-06 16:14:55 +00:00			`where`
add newtypes for QuickCheck to avoid LANG=C issues All properties changed to use them, except for prop_encode_c_decode_c_roundtrip, which already filtered to ascii for other reasons. A few modules had to be split out, because Setup does not build-depend on QuickCheck. 2020-11-10 00:07:31 +00:00			`s = fromTestableFilePath ts`