Better sanitization of problem characters when generating URL and WORM keys.
FAT has a lot of characters it does not allow in filenames, like ? and *
It's probably the worst offender, but other filesystems also have
limitiations.
In 2011, I made keyFile escape : to handle FAT, but missed the other
characters. It also turns out that when I did that, I was also living
dangerously; any existing keys that contained a : had their object
location change. Oops.
So, adding new characters to escape to keyFile is out. Well, it would be
possible to make keyFile behave differently on a per-filesystem basis, but
this would be a real nightmare to get right. Consider that a rsync special
remote uses keyFile to determine the filenames to use, and we don't know
the underlying filesystem on the rsync server..
Instead, I have gone for a solution that is backwards compatable and
simple. Its only downside is that already generated URL and WORM keys
might not be able to be stored on FAT or some other filesystem that
dislikes a character used in the key. (In this case, the user can just
migrate the problem keys to a checksumming backend. If this became a big
problem, fsck could be made to detect these and suggest a migration.)
Going forward, new keys that are created will escape all characters that
are likely to cause problems. And if some filesystem comes along that's
even worse than FAT (seems unlikely, but here it is 2013, and people are
still using FAT!), additional characters can be added to the set that are
escaped without difficulty.
(Also, made WORM limit the part of the filename that is embedded in the key,
to deal with filesystem filename length limits. This could have already
been a problem, but is more likely now, since the escaping of the filename
can make it longer.)
This commit was sponsored by Ian Downes
2013-10-05 19:01:49 +00:00
|
|
|
{- git-annex backend utilities
|
|
|
|
-
|
2024-02-29 21:21:29 +00:00
|
|
|
- Copyright 2012-2024 Joey Hess <id@joeyh.name>
|
Better sanitization of problem characters when generating URL and WORM keys.
FAT has a lot of characters it does not allow in filenames, like ? and *
It's probably the worst offender, but other filesystems also have
limitiations.
In 2011, I made keyFile escape : to handle FAT, but missed the other
characters. It also turns out that when I did that, I was also living
dangerously; any existing keys that contained a : had their object
location change. Oops.
So, adding new characters to escape to keyFile is out. Well, it would be
possible to make keyFile behave differently on a per-filesystem basis, but
this would be a real nightmare to get right. Consider that a rsync special
remote uses keyFile to determine the filenames to use, and we don't know
the underlying filesystem on the rsync server..
Instead, I have gone for a solution that is backwards compatable and
simple. Its only downside is that already generated URL and WORM keys
might not be able to be stored on FAT or some other filesystem that
dislikes a character used in the key. (In this case, the user can just
migrate the problem keys to a checksumming backend. If this became a big
problem, fsck could be made to detect these and suggest a migration.)
Going forward, new keys that are created will escape all characters that
are likely to cause problems. And if some filesystem comes along that's
even worse than FAT (seems unlikely, but here it is 2013, and people are
still using FAT!), additional characters can be added to the set that are
escaped without difficulty.
(Also, made WORM limit the part of the filename that is embedded in the key,
to deal with filesystem filename length limits. This could have already
been a problem, but is more likely now, since the escaping of the filename
can make it longer.)
This commit was sponsored by Ian Downes
2013-10-05 19:01:49 +00:00
|
|
|
-
|
2019-03-13 19:48:14 +00:00
|
|
|
- Licensed under the GNU AGPL version 3 or higher.
|
Better sanitization of problem characters when generating URL and WORM keys.
FAT has a lot of characters it does not allow in filenames, like ? and *
It's probably the worst offender, but other filesystems also have
limitiations.
In 2011, I made keyFile escape : to handle FAT, but missed the other
characters. It also turns out that when I did that, I was also living
dangerously; any existing keys that contained a : had their object
location change. Oops.
So, adding new characters to escape to keyFile is out. Well, it would be
possible to make keyFile behave differently on a per-filesystem basis, but
this would be a real nightmare to get right. Consider that a rsync special
remote uses keyFile to determine the filenames to use, and we don't know
the underlying filesystem on the rsync server..
Instead, I have gone for a solution that is backwards compatable and
simple. Its only downside is that already generated URL and WORM keys
might not be able to be stored on FAT or some other filesystem that
dislikes a character used in the key. (In this case, the user can just
migrate the problem keys to a checksumming backend. If this became a big
problem, fsck could be made to detect these and suggest a migration.)
Going forward, new keys that are created will escape all characters that
are likely to cause problems. And if some filesystem comes along that's
even worse than FAT (seems unlikely, but here it is 2013, and people are
still using FAT!), additional characters can be added to the set that are
escaped without difficulty.
(Also, made WORM limit the part of the filename that is embedded in the key,
to deal with filesystem filename length limits. This could have already
been a problem, but is more likely now, since the escaping of the filename
can make it longer.)
This commit was sponsored by Ian Downes
2013-10-05 19:01:49 +00:00
|
|
|
-}
|
|
|
|
|
2020-07-29 21:12:22 +00:00
|
|
|
{-# LANGUAGE OverloadedStrings #-}
|
|
|
|
|
Better sanitization of problem characters when generating URL and WORM keys.
FAT has a lot of characters it does not allow in filenames, like ? and *
It's probably the worst offender, but other filesystems also have
limitiations.
In 2011, I made keyFile escape : to handle FAT, but missed the other
characters. It also turns out that when I did that, I was also living
dangerously; any existing keys that contained a : had their object
location change. Oops.
So, adding new characters to escape to keyFile is out. Well, it would be
possible to make keyFile behave differently on a per-filesystem basis, but
this would be a real nightmare to get right. Consider that a rsync special
remote uses keyFile to determine the filenames to use, and we don't know
the underlying filesystem on the rsync server..
Instead, I have gone for a solution that is backwards compatable and
simple. Its only downside is that already generated URL and WORM keys
might not be able to be stored on FAT or some other filesystem that
dislikes a character used in the key. (In this case, the user can just
migrate the problem keys to a checksumming backend. If this became a big
problem, fsck could be made to detect these and suggest a migration.)
Going forward, new keys that are created will escape all characters that
are likely to cause problems. And if some filesystem comes along that's
even worse than FAT (seems unlikely, but here it is 2013, and people are
still using FAT!), additional characters can be added to the set that are
escaped without difficulty.
(Also, made WORM limit the part of the filename that is embedded in the key,
to deal with filesystem filename length limits. This could have already
been a problem, but is more likely now, since the escaping of the filename
can make it longer.)
This commit was sponsored by Ian Downes
2013-10-05 19:01:49 +00:00
|
|
|
module Backend.Utilities where
|
|
|
|
|
2016-01-20 20:36:33 +00:00
|
|
|
import Annex.Common
|
2020-07-29 21:12:22 +00:00
|
|
|
import qualified Annex
|
2017-05-15 22:10:13 +00:00
|
|
|
import Utility.Hash
|
2020-07-29 21:12:22 +00:00
|
|
|
import Types.Key
|
|
|
|
import Types.KeySource
|
Better sanitization of problem characters when generating URL and WORM keys.
FAT has a lot of characters it does not allow in filenames, like ? and *
It's probably the worst offender, but other filesystems also have
limitiations.
In 2011, I made keyFile escape : to handle FAT, but missed the other
characters. It also turns out that when I did that, I was also living
dangerously; any existing keys that contained a : had their object
location change. Oops.
So, adding new characters to escape to keyFile is out. Well, it would be
possible to make keyFile behave differently on a per-filesystem basis, but
this would be a real nightmare to get right. Consider that a rsync special
remote uses keyFile to determine the filenames to use, and we don't know
the underlying filesystem on the rsync server..
Instead, I have gone for a solution that is backwards compatable and
simple. Its only downside is that already generated URL and WORM keys
might not be able to be stored on FAT or some other filesystem that
dislikes a character used in the key. (In this case, the user can just
migrate the problem keys to a checksumming backend. If this became a big
problem, fsck could be made to detect these and suggest a migration.)
Going forward, new keys that are created will escape all characters that
are likely to cause problems. And if some filesystem comes along that's
even worse than FAT (seems unlikely, but here it is 2013, and people are
still using FAT!), additional characters can be added to the set that are
escaped without difficulty.
(Also, made WORM limit the part of the filename that is embedded in the key,
to deal with filesystem filename length limits. This could have already
been a problem, but is more likely now, since the escaping of the filename
can make it longer.)
This commit was sponsored by Ian Downes
2013-10-05 19:01:49 +00:00
|
|
|
|
2019-01-11 20:34:04 +00:00
|
|
|
import qualified Data.ByteString as S
|
2021-10-06 00:20:08 +00:00
|
|
|
import qualified Data.ByteString.Short as S (ShortByteString, toShort)
|
2019-12-18 17:26:06 +00:00
|
|
|
import qualified Data.ByteString.Lazy as L
|
2020-07-29 21:12:22 +00:00
|
|
|
import qualified System.FilePath.ByteString as P
|
|
|
|
import Data.Char
|
|
|
|
import Data.Word
|
2019-01-11 20:34:04 +00:00
|
|
|
|
Better sanitization of problem characters when generating URL and WORM keys.
FAT has a lot of characters it does not allow in filenames, like ? and *
It's probably the worst offender, but other filesystems also have
limitiations.
In 2011, I made keyFile escape : to handle FAT, but missed the other
characters. It also turns out that when I did that, I was also living
dangerously; any existing keys that contained a : had their object
location change. Oops.
So, adding new characters to escape to keyFile is out. Well, it would be
possible to make keyFile behave differently on a per-filesystem basis, but
this would be a real nightmare to get right. Consider that a rsync special
remote uses keyFile to determine the filenames to use, and we don't know
the underlying filesystem on the rsync server..
Instead, I have gone for a solution that is backwards compatable and
simple. Its only downside is that already generated URL and WORM keys
might not be able to be stored on FAT or some other filesystem that
dislikes a character used in the key. (In this case, the user can just
migrate the problem keys to a checksumming backend. If this became a big
problem, fsck could be made to detect these and suggest a migration.)
Going forward, new keys that are created will escape all characters that
are likely to cause problems. And if some filesystem comes along that's
even worse than FAT (seems unlikely, but here it is 2013, and people are
still using FAT!), additional characters can be added to the set that are
escaped without difficulty.
(Also, made WORM limit the part of the filename that is embedded in the key,
to deal with filesystem filename length limits. This could have already
been a problem, but is more likely now, since the escaping of the filename
can make it longer.)
This commit was sponsored by Ian Downes
2013-10-05 19:01:49 +00:00
|
|
|
{- Generates a keyName from an input string. Takes care of sanitizing it.
|
|
|
|
- If it's not too long, the full string is used as the keyName.
|
2015-01-06 21:58:57 +00:00
|
|
|
- Otherwise, it's truncated, and its md5 is prepended to ensure a unique
|
|
|
|
- key. -}
|
2021-10-06 00:20:08 +00:00
|
|
|
genKeyName :: String -> S.ShortByteString
|
2015-01-06 21:58:57 +00:00
|
|
|
genKeyName s
|
|
|
|
-- Avoid making keys longer than the length of a SHA256 checksum.
|
2021-10-06 00:20:08 +00:00
|
|
|
| bytelen > sha256len = S.toShort $ encodeBS $
|
2017-05-15 22:10:13 +00:00
|
|
|
truncateFilePath (sha256len - md5len - 1) s' ++ "-" ++
|
2019-12-18 17:26:06 +00:00
|
|
|
show (md5 bl)
|
2021-10-06 00:20:08 +00:00
|
|
|
| otherwise = S.toShort $ encodeBS s'
|
2015-01-06 21:58:57 +00:00
|
|
|
where
|
|
|
|
s' = preSanitizeKeyName s
|
2019-12-18 17:26:06 +00:00
|
|
|
bl = encodeBL s
|
|
|
|
bytelen = fromIntegral $ L.length bl
|
2015-01-06 21:58:57 +00:00
|
|
|
|
|
|
|
sha256len = 64
|
|
|
|
md5len = 32
|
|
|
|
|
2020-07-29 21:12:22 +00:00
|
|
|
{- Converts a key to a version that includes an extension from the
|
|
|
|
- file that the key was generated from. -}
|
|
|
|
addE :: KeySource -> (KeyVariety -> KeyVariety) -> Key -> Annex Key
|
|
|
|
addE source sethasext k = do
|
2024-04-18 18:23:05 +00:00
|
|
|
c <- Annex.getGitConfig
|
|
|
|
let ext = selectExtension
|
|
|
|
(annexMaxExtensionLength c)
|
|
|
|
(annexMaxExtensions c)
|
|
|
|
(keyFilename source)
|
2020-07-29 21:12:22 +00:00
|
|
|
return $ alterKey k $ \d -> d
|
2021-10-06 00:20:08 +00:00
|
|
|
{ keyName = keyName d <> S.toShort ext
|
2020-07-29 21:12:22 +00:00
|
|
|
, keyVariety = sethasext (keyVariety d)
|
|
|
|
}
|
|
|
|
|
2024-04-18 18:23:05 +00:00
|
|
|
selectExtension :: Maybe Int -> Maybe Int -> RawFilePath -> S.ByteString
|
|
|
|
selectExtension maxlen maxextensions f
|
2020-07-29 21:12:22 +00:00
|
|
|
| null es = ""
|
|
|
|
| otherwise = S.intercalate "." ("":es)
|
|
|
|
where
|
|
|
|
es = filter (not . S.null) $ reverse $
|
2024-04-18 18:23:05 +00:00
|
|
|
take (fromMaybe maxExtensions maxextensions) $
|
|
|
|
filter (S.all validInExtension) $
|
2020-07-29 21:12:22 +00:00
|
|
|
takeWhile shortenough $
|
2021-08-03 16:22:58 +00:00
|
|
|
reverse $ S.split (fromIntegral (ord '.')) (P.takeExtensions f')
|
2020-07-29 21:12:22 +00:00
|
|
|
shortenough e = S.length e <= fromMaybe maxExtensionLen maxlen
|
2021-08-03 16:22:58 +00:00
|
|
|
-- Avoid treating a file ".foo" as having its whole name as an
|
|
|
|
-- extension.
|
|
|
|
f' = S.dropWhile (== fromIntegral (ord '.')) (P.takeFileName f)
|
2020-07-29 21:12:22 +00:00
|
|
|
|
|
|
|
validInExtension :: Word8 -> Bool
|
|
|
|
validInExtension c
|
|
|
|
| isAlphaNum (chr (fromIntegral c)) = True
|
|
|
|
| fromIntegral c == ord '.' = True
|
|
|
|
| c <= 127 = False -- other ascii: spaces, punctuation, control chars
|
|
|
|
| otherwise = True -- utf8 is allowed, also other encodings
|
|
|
|
|
|
|
|
maxExtensionLen :: Int
|
|
|
|
maxExtensionLen = 4 -- long enough for "jpeg"
|
2024-04-18 18:23:05 +00:00
|
|
|
|
|
|
|
maxExtensions :: Int
|
|
|
|
maxExtensions = 2 -- include both extensions of "tar.gz"
|