git-annex/Git/Filename.hs

{- Some git commands output encoded filenames, in a rather annoyingly complex
 - C-style encoding.
 -
 - Copyright 2010, 2011 Joey Hess <id@joeyh.name>
 -
 - Licensed under the GNU AGPL version 3 or higher.
 -}

module Git.Filename where

import Common
import Utility.Format (decode_c, encode_c)

import Data.Char
import Data.Word
import qualified Data.ByteString as S

-- encoded filenames will be inside double quotes
decode :: S.ByteString -> RawFilePath
decode b = case S.uncons b of
	Nothing -> b
	Just (h, t)
		| h /= q -> b
		| otherwise -> case S.unsnoc t of
			Nothing -> b
			Just (i, l)
				| l /= q -> b
				| otherwise ->
					encodeBS $ decode_c $ decodeBS i
  where
  	q :: Word8
	q = fromIntegral (ord '"')

{- Should not need to use this, except for testing decode. -}
encode :: RawFilePath -> S.ByteString
encode s = encodeBS $ "\"" ++ encode_c (decodeBS s) ++ "\""

prop_encode_decode_roundtrip :: FilePath -> Bool
prop_encode_decode_roundtrip s = s' ==
	fromRawFilePath (decode (encode (toRawFilePath s')))
  where
	s' = nonul (nohigh s)
	-- Encoding and then decoding roundtrips only when
	-- the string does not contain high unicode, because eg, 
	-- both "\12345" and "\227\128\185" are encoded to
	-- "\343\200\271".
	--
	-- This property papers over the problem, by only
	-- testing ascii
	nohigh = filter isAscii
	-- A String can contain a NUL, but toRawFilePath
	-- truncates on the NUL, which is generally fine
	-- because unix filenames cannot contain NUL.
	-- So the encoding only roundtrips when there is no nul.
	nonul = filter (/= '\NUL')
split out two more Git modules 2011-12-13 19:22:43 +00:00			`{- Some git commands output encoded filenames, in a rather annoyingly complex`
			`- C-style encoding.`
			`-`
update my email address and homepage url 2015-01-21 16:50:09 +00:00			`- Copyright 2010, 2011 Joey Hess <id@joeyh.name>`
split out two more Git modules 2011-12-13 19:22:43 +00:00			`-`
update licenses from GPL to AGPL This does not change the overall license of the git-annex program, which was already AGPL due to a number of sources files being AGPL already. Legally speaking, I'm adding a new license under which these files are now available; I already released their current contents under the GPL license. Now they're dual licensed GPL and AGPL. However, I intend for all my future changes to these files to only be released under the AGPL license, and I won't be tracking the dual licensing status, so I'm simply changing the license statement to say it's AGPL. (In some cases, others wrote parts of the code of a file and released it under the GPL; but in all cases I have contributed a significant portion of the code in each file and it's that code that is getting the AGPL license; the GPL license of other contributors allows combining with AGPL code.) 2019-03-13 19:48:14 +00:00			`- Licensed under the GNU AGPL version 3 or higher.`
split out two more Git modules 2011-12-13 19:22:43 +00:00			`-}`

			`module Git.Filename where`

fix failing quickcheck properties QuickCheck 2.10 found a counterexample eg "\929184" broke the property. As far as I can tell, Git.Filename is matching how git handles encoding of strange high unicode characters in filenames for display. Git does not display high unicode characters, and instead displays the C-style escaped form of each byte. This is ambiguous, but since git is not unicode aware, it doesn't need to roundtrip parse it. So, making Git.FileName's roundtrip test only chars < 256 seems fine. Utility.Format.format uses encode_c, in order to mimic git, so that's ok. Utility.Format.gen uses decode_c, but only so that stuff like "\n" in the format string is handled. If the format string contains C-style octal escapes, they will be converted to ascii characters, and not combined into unicode characters, but that should not be a problem. If the user wants unicode characters, they can include them in the format string, without escaping them. Finally, decode_c is used by Utility.Gpg.secretKeys, because gpg --with-colons hex-escapes some characters in particular ':' and '\\'. gpg passes unicode through, so this use of decode_c is not a problem. This commit was sponsored by Henrik Riomar on Patreon. 2017-06-17 20:17:09 +00:00			`import Common`
handle C-style escapes in Format I was happily able to repurpose some code from Git.Filename to handle this. I remember writing that code... a whole afternoon at a coffee shop, after which I felt I'd struggled with Haskell and git, and sorta lost, in needing to write this nasty peice of code. But was also pleased at the use of a pair of functions and quickcheck that allowed me to get it 100% right. So, turns out I not only got it right, but the code wasn't as special-purpose as I'd feared. Yay! 2011-12-23 00:14:35 +00:00			`import Utility.Format (decode_c, encode_c)`
split out two more Git modules 2011-12-13 19:22:43 +00:00
fix failing quickcheck properties QuickCheck 2.10 found a counterexample eg "\929184" broke the property. As far as I can tell, Git.Filename is matching how git handles encoding of strange high unicode characters in filenames for display. Git does not display high unicode characters, and instead displays the C-style escaped form of each byte. This is ambiguous, but since git is not unicode aware, it doesn't need to roundtrip parse it. So, making Git.FileName's roundtrip test only chars < 256 seems fine. Utility.Format.format uses encode_c, in order to mimic git, so that's ok. Utility.Format.gen uses decode_c, but only so that stuff like "\n" in the format string is handled. If the format string contains C-style octal escapes, they will be converted to ascii characters, and not combined into unicode characters, but that should not be a problem. If the user wants unicode characters, they can include them in the format string, without escaping them. Finally, decode_c is used by Utility.Gpg.secretKeys, because gpg --with-colons hex-escapes some characters in particular ':' and '\\'. gpg passes unicode through, so this use of decode_c is not a problem. This commit was sponsored by Henrik Riomar on Patreon. 2017-06-17 20:17:09 +00:00			`import Data.Char`
wip RawFilePath Goal is to make git-annex faster by using ByteString for all the worktree traversal. For now, this is focusing on Command.Find, in order to benchmark how much it helps. (All other commands are temporarily disabled) Currently in a very bad unbuildable in-between state. 2019-11-25 20:18:19 +00:00			`import Data.Word`
			`import qualified Data.ByteString as S`
use Common in a few more modules 2011-12-20 18:37:53 +00:00
wip RawFilePath Goal is to make git-annex faster by using ByteString for all the worktree traversal. For now, this is focusing on Command.Find, in order to benchmark how much it helps. (All other commands are temporarily disabled) Currently in a very bad unbuildable in-between state. 2019-11-25 20:18:19 +00:00			`-- encoded filenames will be inside double quotes`
			`decode :: S.ByteString -> RawFilePath`
			`decode b = case S.uncons b of`
			`Nothing -> b`
			`Just (h, t)`
			`\| h /= q -> b`
			`\| otherwise -> case S.unsnoc t of`
			`Nothing -> b`
			`Just (i, l)`
			`\| l /= q -> b`
			`\| otherwise ->`
			`encodeBS $ decode_c $ decodeBS i`
			`where`
			`q :: Word8`
			`q = fromIntegral (ord '"')`
split out two more Git modules 2011-12-13 19:22:43 +00:00
			`{- Should not need to use this, except for testing decode. -}`
wip RawFilePath Goal is to make git-annex faster by using ByteString for all the worktree traversal. For now, this is focusing on Command.Find, in order to benchmark how much it helps. (All other commands are temporarily disabled) Currently in a very bad unbuildable in-between state. 2019-11-25 20:18:19 +00:00			`encode :: RawFilePath -> S.ByteString`
			`encode s = encodeBS $ "\"" ++ encode_c (decodeBS s) ++ "\""`
split out two more Git modules 2011-12-13 19:22:43 +00:00
wip RawFilePath 2x git-annex find speedup Finally builds (oh the agoncy of making it build), but still very unmergable, only Command.Find is included and lots of stuff is badly hacked to make it compile. Benchmarking vs master, this git-annex find is significantly faster! Specifically: num files old new speedup 48500 4.77 3.73 28% 12500 1.36 1.02 66% 20 0.075 0.074 0% (so startup time is unchanged) That's without really finishing the optimization. Things still to do: * Eliminate all the fromRawFilePath, toRawFilePath, encodeBS, decodeBS conversions. * Use versions of IO actions like getFileStatus that take a RawFilePath. * Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy. * Use ByteString for parsing git config to speed up startup. It's likely several of those will speed up git-annex find further. And other commands will certianly benefit even more. 2019-11-26 19:27:22 +00:00			`prop_encode_decode_roundtrip :: FilePath -> Bool`
fix another quickcheck property broken by NUL in Arbitrary String 2019-12-06 17:12:35 +00:00			`prop_encode_decode_roundtrip s = s' ==`
			`fromRawFilePath (decode (encode (toRawFilePath s')))`
add back lost filtering of multibyte chars in prop_encode_decode_roundtrip I had thought using ByteString would avoid the problem, but the quickcheck property is still taking Arbitrary String input, so the use of ByteString internally doesn't matter. 2019-12-06 16:14:55 +00:00			`where`
fix another quickcheck property broken by NUL in Arbitrary String 2019-12-06 17:12:35 +00:00			`s' = nonul (nohigh s)`
			`-- Encoding and then decoding roundtrips only when`
			`-- the string does not contain high unicode, because eg,`
			`-- both "\12345" and "\227\128\185" are encoded to`
			`-- "\343\200\271".`
			`--`
			`-- This property papers over the problem, by only`
fix quickcheck failure prop_encode_decode_roundtrip failed on "\175" in C locale. This may be a new problem after the switch to RawFilePath, but it already had filtering for high chars, so changed to only test ascii chars. 2019-12-30 17:54:46 +00:00			`-- testing ascii`
			`nohigh = filter isAscii`
fix another quickcheck property broken by NUL in Arbitrary String 2019-12-06 17:12:35 +00:00			`-- A String can contain a NUL, but toRawFilePath`
			`-- truncates on the NUL, which is generally fine`
			`-- because unix filenames cannot contain NUL.`
			`-- So the encoding only roundtrips when there is no nul.`
			`nonul = filter (/= '\NUL')`