fully specify the pointer file format
This format is designed to detect accidental appends, while having some room for future expansion. Detect when an unlocked file whose content is not present has gotten some other content appended to it, and avoid treating it as a pointer file, so that appended content will not be checked into git, but will be annexed like any other file. Dropped the max size of a pointer file down to 32kb, it was around 80 kb, but without any good reason and certianly there are no valid pointer files anywhere that are larger than 8kb, because it's just been specified what it means for a pointer file with additional data even looks like. I assume 32kb will be good enough for anyone. ;-) Really though, it needs to be some smallish number, because that much of a file in git gets read into memory when eg, catting pointer files. And since we have no use cases for the extra lines of a pointer file yet, except possibly to add some human-visible explanation that it is a git-annex pointer file, 32k seems as reasonable an arbitrary number as anything. Increasing it would be possible, eg to 64k, as long as users of such jumbo pointer files didn't mind upgrading all their git-annex installations to one that supports the new larger size. Sponsored-by: Dartmouth College's Datalad project
This commit is contained in:
parent
649464619e
commit
67245ae00f
5 changed files with 113 additions and 9 deletions
|
@ -295,17 +295,45 @@ unableToRestage mf = unwords
|
||||||
, "git update-index -q --refresh " ++ fromMaybe "<file>" mf
|
, "git update-index -q --refresh " ++ fromMaybe "<file>" mf
|
||||||
]
|
]
|
||||||
|
|
||||||
{- Parses a symlink target or a pointer file to a Key. -}
|
{- Parses a symlink target or a pointer file to a Key.
|
||||||
|
-
|
||||||
|
- Makes sure that the pointer file is valid, including not being longer
|
||||||
|
- than the maximum allowed size of a valid pointer file, and that any
|
||||||
|
- subsequent lines after the first contain the validPointerLineTag.
|
||||||
|
- If a valid pointer file gets some other data appended to it, it should
|
||||||
|
- never be considered valid, unless that data happened to itself be a
|
||||||
|
- valid pointer file.
|
||||||
|
-}
|
||||||
parseLinkTargetOrPointer :: S.ByteString -> Maybe Key
|
parseLinkTargetOrPointer :: S.ByteString -> Maybe Key
|
||||||
parseLinkTargetOrPointer = go . S8.takeWhile (not . lineend)
|
parseLinkTargetOrPointer b
|
||||||
|
| S.length b <= maxValidPointerSz =
|
||||||
|
let (firstline, rest) = S8.span (/= '\n') b
|
||||||
|
in case parsekey $ droptrailing '\r' firstline of
|
||||||
|
Just k | restvalid (dropleading '\n' rest) -> Just k
|
||||||
|
_ -> Nothing
|
||||||
|
| otherwise = Nothing
|
||||||
where
|
where
|
||||||
go l
|
parsekey l
|
||||||
| isLinkToAnnex l = fileKey $ snd $ S8.breakEnd pathsep l
|
| isLinkToAnnex l = fileKey $ snd $ S8.breakEnd pathsep l
|
||||||
| otherwise = Nothing
|
| otherwise = Nothing
|
||||||
|
|
||||||
lineend '\n' = True
|
restvalid r
|
||||||
lineend '\r' = True
|
| S.null r = True
|
||||||
lineend _ = False
|
| otherwise =
|
||||||
|
let (l, r') = S8.span (/= '\n') r
|
||||||
|
in validPointerLineTag `S.isInfixOf` l
|
||||||
|
&& (not (S8.null r') && S8.head r' == '\n')
|
||||||
|
&& restvalid (S8.tail r')
|
||||||
|
|
||||||
|
dropleading c l
|
||||||
|
| S.null l = l
|
||||||
|
| S8.head l == c = S8.tail l
|
||||||
|
| otherwise = l
|
||||||
|
|
||||||
|
droptrailing c l
|
||||||
|
| S.null l = l
|
||||||
|
| S8.last l == c = S8.init l
|
||||||
|
| otherwise = l
|
||||||
|
|
||||||
pathsep '/' = True
|
pathsep '/' = True
|
||||||
#ifdef mingw32_HOST_OS
|
#ifdef mingw32_HOST_OS
|
||||||
|
@ -332,9 +360,17 @@ formatPointer k = prefix <> keyFile k <> nl
|
||||||
-
|
-
|
||||||
- 8192 bytes is plenty for a pointer to a key. This adds some additional
|
- 8192 bytes is plenty for a pointer to a key. This adds some additional
|
||||||
- padding to allow for pointer files that have lines of additional data
|
- padding to allow for pointer files that have lines of additional data
|
||||||
- after the key. -}
|
- after the key.
|
||||||
|
-
|
||||||
|
- One additional byte is used to detect when a valid pointer file
|
||||||
|
- got something else appended to it.
|
||||||
|
-}
|
||||||
maxPointerSz :: Int
|
maxPointerSz :: Int
|
||||||
maxPointerSz = 81920
|
maxPointerSz = maxValidPointerSz + 1
|
||||||
|
|
||||||
|
{- Maximum size of a valid pointer files is 32kb. -}
|
||||||
|
maxValidPointerSz :: Int
|
||||||
|
maxValidPointerSz = 32768
|
||||||
|
|
||||||
maxSymlinkSz :: Int
|
maxSymlinkSz :: Int
|
||||||
maxSymlinkSz = 8192
|
maxSymlinkSz = 8192
|
||||||
|
@ -387,3 +423,7 @@ isLinkToAnnex s = p `S.isInfixOf` s
|
||||||
#ifdef mingw32_HOST_OS
|
#ifdef mingw32_HOST_OS
|
||||||
p' = toInternalGitPath p
|
p' = toInternalGitPath p
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
{- String that must appear on every line of a valid pointer file. -}
|
||||||
|
validPointerLineTag :: S.ByteString
|
||||||
|
validPointerLineTag = "/annex/"
|
||||||
|
|
|
@ -1,3 +1,12 @@
|
||||||
|
git-annex (10.20220223) UNRELEASED; urgency=medium
|
||||||
|
|
||||||
|
* Detect when an unlocked file whose content is not present has gotten
|
||||||
|
some other content appended to it, and avoid treating it as a pointer
|
||||||
|
file, so that appended content will not be checked into git, but will
|
||||||
|
be annexed like any other file.
|
||||||
|
|
||||||
|
-- Joey Hess <id@joeyh.name> Wed, 23 Feb 2022 14:14:09 -0400
|
||||||
|
|
||||||
git-annex (10.20220222) upstream; urgency=medium
|
git-annex (10.20220222) upstream; urgency=medium
|
||||||
|
|
||||||
* annex.skipunknown now defaults to false, so commands like
|
* annex.skipunknown now defaults to false, so commands like
|
||||||
|
|
|
@ -0,0 +1,30 @@
|
||||||
|
[[!comment format=mdwn
|
||||||
|
username="joey"
|
||||||
|
subject="""comment 4"""
|
||||||
|
date="2022-02-23T16:45:24Z"
|
||||||
|
content="""
|
||||||
|
I've now specified a format in [[internals/pointer_file]], which is
|
||||||
|
designed to allow detecting accidental appends.
|
||||||
|
|
||||||
|
And git-annex will now treat a pointer file that has been appeneded to as
|
||||||
|
not a pointer file any longer.
|
||||||
|
|
||||||
|
So, for example:
|
||||||
|
|
||||||
|
joey@darkstar:/tmp/r>echo oops >> foo
|
||||||
|
joey@darkstar:/tmp/r>cat foo
|
||||||
|
/annex/objects/SHA256E-s14169--bdcf6188db530bc3af79c898208ce2a56df6197f59b3872b03613a248ac8faf4
|
||||||
|
oops
|
||||||
|
joey@darkstar:/tmp/r>git add foo
|
||||||
|
joey@darkstar:/tmp/r>git diff --cached foo | tail -n 2
|
||||||
|
-/annex/objects/SHA256E-s14169--bdcf6188db530bc3af79c898208ce2a56df6197f59b3872b03613a248ac8faf4
|
||||||
|
+/annex/objects/SHA256E-s101--b7da3d6b0ad2f6a2a263e783e59efb60f2520f03bb36cea35a556a684b0d5c9d
|
||||||
|
|
||||||
|
Since the file is not a valid pointer file after being appended to,
|
||||||
|
git add does what it would do with any file, in this case adding the
|
||||||
|
content to the annex.
|
||||||
|
|
||||||
|
So at least it keeps the possibly large appeneded content out of git now.
|
||||||
|
I think that's the most important thing. Detecting and warning about
|
||||||
|
pointer files that are not valid due to appends should be easy from here.
|
||||||
|
"""]]
|
|
@ -7,7 +7,7 @@ some documentation to that end.
|
||||||
### `.git/annex/objects/aa/bb/*/*`
|
### `.git/annex/objects/aa/bb/*/*`
|
||||||
|
|
||||||
This is where locally available file contents are actually stored.
|
This is where locally available file contents are actually stored.
|
||||||
Files added to the annex get a symlink or pointer file checked into git,
|
Files added to the annex get a symlink or [[pointer_file]] checked into git,
|
||||||
that points to the file content.
|
that points to the file content.
|
||||||
|
|
||||||
First there are two levels of directories used for hashing, to prevent
|
First there are two levels of directories used for hashing, to prevent
|
||||||
|
|
25
doc/internals/pointer_file.mdwn
Normal file
25
doc/internals/pointer_file.mdwn
Normal file
|
@ -0,0 +1,25 @@
|
||||||
|
A pointer file is one of two ways that an annex object can be checked into
|
||||||
|
git. The other is a symbolic link pointing to a file in the
|
||||||
|
.git/annex/objects/ directory.
|
||||||
|
|
||||||
|
A pointer file starts with "/annex/objects/", which is followed
|
||||||
|
by the key (see [[key_format]]). (In some situations a pointer file
|
||||||
|
might instead contain the content of a symlink target.)
|
||||||
|
|
||||||
|
Pointer files usually have a newline after the key. This is not required.
|
||||||
|
A carriage return followed by a newline is also accepted, as is end of file.
|
||||||
|
|
||||||
|
After that, there is usually nothing more in a pointer file, but git-annex
|
||||||
|
does support pointer files with additional text on subsequent lines.
|
||||||
|
|
||||||
|
Every such subsequent line has to contain "/annex/" somewhere in it,
|
||||||
|
and end in a newline. Otherwise it not considered to be a valid pointer file.
|
||||||
|
|
||||||
|
The maximum size of a pointer file is 32 kb. If it is any longer, it is not
|
||||||
|
considered to be a valid pointer file.
|
||||||
|
|
||||||
|
The possibility exists that a pointer file is in a working tree,
|
||||||
|
representing an annex object that is not present, and something appends
|
||||||
|
data onto it accidentally. The limitation that each line of a valid
|
||||||
|
pointer file contains "/annex/" and that it cannot be larger than 32kb
|
||||||
|
let such a situation be detected.
|
Loading…
Reference in a new issue