git-annex/Utility/Base64.hs

{- Simple Base64 encoding
 -
 - Copyright 2011-2019 Joey Hess <id@joeyh.name>
 -
 - License: BSD-2-clause
 -}

module Utility.Base64 where

import Utility.FileSystemEncoding

import qualified "sandi" Codec.Binary.Base64 as B64
import Data.Maybe
import qualified Data.ByteString as B
import Data.ByteString.UTF8 (fromString, toString)
import Data.Char

-- | This uses the FileSystemEncoding, so it can be used on Strings
-- that repesent filepaths containing arbitrarily encoded characters.
toB64 :: String -> String
toB64 = toString . B64.encode . encodeBS

toB64' :: B.ByteString -> B.ByteString
toB64' = B64.encode

fromB64Maybe :: String -> Maybe String
fromB64Maybe s = either (const Nothing) (Just . decodeBS)
	(B64.decode $ fromString s)

fromB64Maybe' :: B.ByteString -> Maybe (B.ByteString)
fromB64Maybe' = either (const Nothing) Just . B64.decode

fromB64 :: String -> String
fromB64 = fromMaybe bad . fromB64Maybe
  where
	bad = error "bad base64 encoded data"

fromB64' :: B.ByteString -> B.ByteString
fromB64' = fromMaybe bad . fromB64Maybe'
  where
	bad = error "bad base64 encoded data"

-- Only ascii strings are tested, because an arbitrary string may contain
-- characters not encoded using the FileSystemEncoding, which would thus
-- not roundtrip, as decodeBS always generates an output encoded that way.
prop_b64_roundtrips :: String -> Bool
prop_b64_roundtrips s
	| all (isAscii) s = s == decodeBS (fromB64' (toB64' (encodeBS s)))
	| otherwise = True
switch MetaValue to ByteString and MetaField to Text MetaField was already limited to alphanumerics, so it makes sense to use Text for it. Note that technically a UUID can contain invalid UTF-8, and so remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8 values with '?' or whatever. In practice, a UUID is usually also text, I only kept open the possibility of it containing invalid UTF-8 to avoid breaking parsing of strange UUIDs in git-annex branch files. So, I decided to let this edge case slip by. Have not updated the rest of the code base yet for this change, as the change took 2.5 hours longer than I expected to get working properly. 2019-01-07 18:18:24 +00:00			`{- Simple Base64 encoding`
Fix setting/setting/viewing metadata that contains unicode or other special characters, when in a non-unicode locale. Oh boy, not again. So, another place that the filesystem encoding needs to be applied. Yay. In passing, I changed decodeBS so if a NUL is embedded in the input, the resulting FilePath doesn't get truncated at that NUL. This was needed to make prop_b64_roundtrips pass, and on reviewing the callers of decodeBS, I didn't see any where this wouldn't make sense. When a FilePath is used to operate on the filesystem, it'll get truncated at a NUL anyway, whereas if a String is being used for something else, it might conceivably have a NUL in it, and we wouldn't want it to get truncated when going through decodeBS. (NB: There may be a speed impact from this change.) 2015-08-11 22:40:59 +00:00			`-`
switch MetaValue to ByteString and MetaField to Text MetaField was already limited to alphanumerics, so it makes sense to use Text for it. Note that technically a UUID can contain invalid UTF-8, and so remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8 values with '?' or whatever. In practice, a UUID is usually also text, I only kept open the possibility of it containing invalid UTF-8 to avoid breaking parsing of strange UUIDs in git-annex branch files. So, I decided to let this edge case slip by. Have not updated the rest of the code base yet for this change, as the change took 2.5 hours longer than I expected to get working properly. 2019-01-07 18:18:24 +00:00			`- Copyright 2011-2019 Joey Hess <id@joeyh.name>`
factor out base64 code 2011-05-01 18:27:40 +00:00			`-`
relicense general utility library code to BSD Omitted a couple of files what have had significant contributions from others. 2014-05-10 14:01:27 +00:00			`- License: BSD-2-clause`
factor out base64 code 2011-05-01 18:27:40 +00:00			`-}`

switch MetaValue to ByteString and MetaField to Text MetaField was already limited to alphanumerics, so it makes sense to use Text for it. Note that technically a UUID can contain invalid UTF-8, and so remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8 values with '?' or whatever. In practice, a UUID is usually also text, I only kept open the possibility of it containing invalid UTF-8 to avoid breaking parsing of strange UUIDs in git-annex branch files. So, I decided to let this edge case slip by. Have not updated the rest of the code base yet for this change, as the change took 2.5 hours longer than I expected to get working properly. 2019-01-07 18:18:24 +00:00			`module Utility.Base64 where`
factor out base64 code 2011-05-01 18:27:40 +00:00
fix test suite fail in LANG=C This was caused by 23e9d3bb777d580931a5b4d7ee57aa604190d8b2 an Arbitrary String is not necessarily encoded using the filesystem encoding, and in a non-utf8 locale, encodeBS throws an exception on such a string. All I could think to do is limit test data to ascii. This shouldn't be a problem in practice, because the all Strings in git-annex that are not generated by Arbitrary should be loaded in a way that does apply the filesystem encoding. 2015-08-12 14:36:51 +00:00			`import Utility.FileSystemEncoding`

Replace dataenc with sandi. The library dataenc is unmaintained and has been superseded by sandi. Signed-off-by: Magnus Therning <magnus@therning.org> 2013-08-06 09:00:52 +00:00			`import qualified "sandi" Codec.Binary.Base64 as B64`
tag xmpp pushes with jid This fixes the issue mentioned in the last commit. Turns out just collecting UUID of clients behind a XMPP remote is insufficient (although I should probably still do it for other reasons), because a single remote repo might be connected via both XMPP and local pairing. So a way is needed to know when a push was received from any client using a given XMPP remote over XMPP, as opposed to via ssh. 2013-03-06 20:29:19 +00:00			`import Data.Maybe`
switch MetaValue to ByteString and MetaField to Text MetaField was already limited to alphanumerics, so it makes sense to use Text for it. Note that technically a UUID can contain invalid UTF-8, and so remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8 values with '?' or whatever. In practice, a UUID is usually also text, I only kept open the possibility of it containing invalid UTF-8 to avoid breaking parsing of strange UUIDs in git-annex branch files. So, I decided to let this edge case slip by. Have not updated the rest of the code base yet for this change, as the change took 2.5 hours longer than I expected to get working properly. 2019-01-07 18:18:24 +00:00			`import qualified Data.ByteString as B`
Replace dataenc with sandi. The library dataenc is unmaintained and has been superseded by sandi. Signed-off-by: Magnus Therning <magnus@therning.org> 2013-08-06 09:00:52 +00:00			`import Data.ByteString.UTF8 (fromString, toString)`
fix test suite fail in LANG=C This was caused by 23e9d3bb777d580931a5b4d7ee57aa604190d8b2 an Arbitrary String is not necessarily encoded using the filesystem encoding, and in a non-utf8 locale, encodeBS throws an exception on such a string. All I could think to do is limit test data to ascii. This shouldn't be a problem in practice, because the all Strings in git-annex that are not generated by Arbitrary should be loaded in a way that does apply the filesystem encoding. 2015-08-12 14:36:51 +00:00			`import Data.Char`
factor out base64 code 2011-05-01 18:27:40 +00:00
switch MetaValue to ByteString and MetaField to Text MetaField was already limited to alphanumerics, so it makes sense to use Text for it. Note that technically a UUID can contain invalid UTF-8, and so remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8 values with '?' or whatever. In practice, a UUID is usually also text, I only kept open the possibility of it containing invalid UTF-8 to avoid breaking parsing of strange UUIDs in git-annex branch files. So, I decided to let this edge case slip by. Have not updated the rest of the code base yet for this change, as the change took 2.5 hours longer than I expected to get working properly. 2019-01-07 18:18:24 +00:00			`-- \| This uses the FileSystemEncoding, so it can be used on Strings`
			`-- that repesent filepaths containing arbitrarily encoded characters.`
			`toB64 :: String -> String`
strict bytestring encoders and decoders Only had lazy ones before. Already sped up a few parts of the code. 2019-01-01 18:54:06 +00:00			`toB64 = toString . B64.encode . encodeBS`
factor out base64 code 2011-05-01 18:27:40 +00:00
switch MetaValue to ByteString and MetaField to Text MetaField was already limited to alphanumerics, so it makes sense to use Text for it. Note that technically a UUID can contain invalid UTF-8, and so remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8 values with '?' or whatever. In practice, a UUID is usually also text, I only kept open the possibility of it containing invalid UTF-8 to avoid breaking parsing of strange UUIDs in git-annex branch files. So, I decided to let this edge case slip by. Have not updated the rest of the code base yet for this change, as the change took 2.5 hours longer than I expected to get working properly. 2019-01-07 18:18:24 +00:00			`toB64' :: B.ByteString -> B.ByteString`
			`toB64' = B64.encode`

tag xmpp pushes with jid This fixes the issue mentioned in the last commit. Turns out just collecting UUID of clients behind a XMPP remote is insufficient (although I should probably still do it for other reasons), because a single remote repo might be connected via both XMPP and local pairing. So a way is needed to know when a push was received from any client using a given XMPP remote over XMPP, as opposed to via ssh. 2013-03-06 20:29:19 +00:00			`fromB64Maybe :: String -> Maybe String`
switch MetaValue to ByteString and MetaField to Text MetaField was already limited to alphanumerics, so it makes sense to use Text for it. Note that technically a UUID can contain invalid UTF-8, and so remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8 values with '?' or whatever. In practice, a UUID is usually also text, I only kept open the possibility of it containing invalid UTF-8 to avoid breaking parsing of strange UUIDs in git-annex branch files. So, I decided to let this edge case slip by. Have not updated the rest of the code base yet for this change, as the change took 2.5 hours longer than I expected to get working properly. 2019-01-07 18:18:24 +00:00			`fromB64Maybe s = either (const Nothing) (Just . decodeBS)`
Replace dataenc with sandi. The library dataenc is unmaintained and has been superseded by sandi. Signed-off-by: Magnus Therning <magnus@therning.org> 2013-08-06 09:00:52 +00:00			`(B64.decode $ fromString s)`
tag xmpp pushes with jid This fixes the issue mentioned in the last commit. Turns out just collecting UUID of clients behind a XMPP remote is insufficient (although I should probably still do it for other reasons), because a single remote repo might be connected via both XMPP and local pairing. So a way is needed to know when a push was received from any client using a given XMPP remote over XMPP, as opposed to via ssh. 2013-03-06 20:29:19 +00:00
switch MetaValue to ByteString and MetaField to Text MetaField was already limited to alphanumerics, so it makes sense to use Text for it. Note that technically a UUID can contain invalid UTF-8, and so remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8 values with '?' or whatever. In practice, a UUID is usually also text, I only kept open the possibility of it containing invalid UTF-8 to avoid breaking parsing of strange UUIDs in git-annex branch files. So, I decided to let this edge case slip by. Have not updated the rest of the code base yet for this change, as the change took 2.5 hours longer than I expected to get working properly. 2019-01-07 18:18:24 +00:00			`fromB64Maybe' :: B.ByteString -> Maybe (B.ByteString)`
			`fromB64Maybe' = either (const Nothing) Just . B64.decode`

factor out base64 code 2011-05-01 18:27:40 +00:00			`fromB64 :: String -> String`
tag xmpp pushes with jid This fixes the issue mentioned in the last commit. Turns out just collecting UUID of clients behind a XMPP remote is insufficient (although I should probably still do it for other reasons), because a single remote repo might be connected via both XMPP and local pairing. So a way is needed to know when a push was received from any client using a given XMPP remote over XMPP, as opposed to via ssh. 2013-03-06 20:29:19 +00:00			`fromB64 = fromMaybe bad . fromB64Maybe`
			`where`
			`bad = error "bad base64 encoded data"`
metadata: Fix encoding problem that led to mojibake when storing metadata strings that contained both unicode characters and a space (or '!') character. The fix is to stop using w82s, which does not properly reconstitute unicode strings. Instrad, use utf8 bytestring to get the [Word8] to base64. This passes unicode through perfectly, including any invalid filesystem encoded characters. Note that toB64 / fromB64 are also used for creds and cipher embedding. It would be unfortunate if this change broke those uses. For cipher embedding, note that ciphers can contain arbitrary bytes (should really be using ByteString.Char8 there). Testing indicated it's not safe to use the new fromB64 there; I think that characters were incorrectly combined. For credpair embedding, the username or password could contain unicode. Before, that unicode would fail to round-trip through the b64. So, I guess this is not going to break any embedded creds that worked before. This bug may have affected some creds before, and if so, this change will not fix old ones, but should fix new ones at least. 2015-03-04 15:16:03 +00:00
switch MetaValue to ByteString and MetaField to Text MetaField was already limited to alphanumerics, so it makes sense to use Text for it. Note that technically a UUID can contain invalid UTF-8, and so remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8 values with '?' or whatever. In practice, a UUID is usually also text, I only kept open the possibility of it containing invalid UTF-8 to avoid breaking parsing of strange UUIDs in git-annex branch files. So, I decided to let this edge case slip by. Have not updated the rest of the code base yet for this change, as the change took 2.5 hours longer than I expected to get working properly. 2019-01-07 18:18:24 +00:00			`fromB64' :: B.ByteString -> B.ByteString`
			`fromB64' = fromMaybe bad . fromB64Maybe'`
			`where`
			`bad = error "bad base64 encoded data"`

fix test suite fail in LANG=C This was caused by 23e9d3bb777d580931a5b4d7ee57aa604190d8b2 an Arbitrary String is not necessarily encoded using the filesystem encoding, and in a non-utf8 locale, encodeBS throws an exception on such a string. All I could think to do is limit test data to ascii. This shouldn't be a problem in practice, because the all Strings in git-annex that are not generated by Arbitrary should be loaded in a way that does apply the filesystem encoding. 2015-08-12 14:36:51 +00:00			`-- Only ascii strings are tested, because an arbitrary string may contain`
avoid throwing exception when String is not encoded using the filesystem encoding Since _encodeFilePath generates a String that doesn't use the filesystem encoding, when this exception is caught, we know we already have such a String, and can just return it as-is. 2015-08-12 14:57:48 +00:00			`-- characters not encoded using the FileSystemEncoding, which would thus`
switch MetaValue to ByteString and MetaField to Text MetaField was already limited to alphanumerics, so it makes sense to use Text for it. Note that technically a UUID can contain invalid UTF-8, and so remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8 values with '?' or whatever. In practice, a UUID is usually also text, I only kept open the possibility of it containing invalid UTF-8 to avoid breaking parsing of strange UUIDs in git-annex branch files. So, I decided to let this edge case slip by. Have not updated the rest of the code base yet for this change, as the change took 2.5 hours longer than I expected to get working properly. 2019-01-07 18:18:24 +00:00			`-- not roundtrip, as decodeBS always generates an output encoded that way.`
metadata: Fix encoding problem that led to mojibake when storing metadata strings that contained both unicode characters and a space (or '!') character. The fix is to stop using w82s, which does not properly reconstitute unicode strings. Instrad, use utf8 bytestring to get the [Word8] to base64. This passes unicode through perfectly, including any invalid filesystem encoded characters. Note that toB64 / fromB64 are also used for creds and cipher embedding. It would be unfortunate if this change broke those uses. For cipher embedding, note that ciphers can contain arbitrary bytes (should really be using ByteString.Char8 there). Testing indicated it's not safe to use the new fromB64 there; I think that characters were incorrectly combined. For credpair embedding, the username or password could contain unicode. Before, that unicode would fail to round-trip through the b64. So, I guess this is not going to break any embedded creds that worked before. This bug may have affected some creds before, and if so, this change will not fix old ones, but should fix new ones at least. 2015-03-04 15:16:03 +00:00			`prop_b64_roundtrips :: String -> Bool`
fix test suite fail in LANG=C This was caused by 23e9d3bb777d580931a5b4d7ee57aa604190d8b2 an Arbitrary String is not necessarily encoded using the filesystem encoding, and in a non-utf8 locale, encodeBS throws an exception on such a string. All I could think to do is limit test data to ascii. This shouldn't be a problem in practice, because the all Strings in git-annex that are not generated by Arbitrary should be loaded in a way that does apply the filesystem encoding. 2015-08-12 14:36:51 +00:00			`prop_b64_roundtrips s`
switch MetaValue to ByteString and MetaField to Text MetaField was already limited to alphanumerics, so it makes sense to use Text for it. Note that technically a UUID can contain invalid UTF-8, and so remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8 values with '?' or whatever. In practice, a UUID is usually also text, I only kept open the possibility of it containing invalid UTF-8 to avoid breaking parsing of strange UUIDs in git-annex branch files. So, I decided to let this edge case slip by. Have not updated the rest of the code base yet for this change, as the change took 2.5 hours longer than I expected to get working properly. 2019-01-07 18:18:24 +00:00			`\| all (isAscii) s = s == decodeBS (fromB64' (toB64' (encodeBS s)))`
fix test suite fail in LANG=C This was caused by 23e9d3bb777d580931a5b4d7ee57aa604190d8b2 an Arbitrary String is not necessarily encoded using the filesystem encoding, and in a non-utf8 locale, encodeBS throws an exception on such a string. All I could think to do is limit test data to ascii. This shouldn't be a problem in practice, because the all Strings in git-annex that are not generated by Arbitrary should be loaded in a way that does apply the filesystem encoding. 2015-08-12 14:36:51 +00:00			`\| otherwise = True`