fully specify the pointer file format

This format is designed to detect accidental appends, while having some
room for future expansion.

Detect when an unlocked file whose content is not present has gotten some
other content appended to it, and avoid treating it as a pointer file, so
that appended content will not be checked into git, but will be annexed
like any other file.

Dropped the max size of a pointer file down to 32kb, it was around 80 kb,
but without any good reason and certianly there are no valid pointer files
anywhere that are larger than 8kb, because it's just been specified what it
means for a pointer file with additional data even looks like.

I assume 32kb will be good enough for anyone. ;-) Really though, it needs
to be some smallish number, because that much of a file in git gets read
into memory when eg, catting pointer files. And since we have no use cases
for the extra lines of a pointer file yet, except possibly to add
some human-visible explanation that it is a git-annex pointer file, 32k
seems as reasonable an arbitrary number as anything. Increasing it would be
possible, eg to 64k, as long as users of such jumbo pointer files didn't
mind upgrading all their git-annex installations to one that supports the
new larger size.

Sponsored-by: Dartmouth College's Datalad project
This commit is contained in:
Joey Hess 2022-02-23 14:20:31 -04:00
parent 649464619e
commit 67245ae00f
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
5 changed files with 113 additions and 9 deletions

View file

@ -295,17 +295,45 @@ unableToRestage mf = unwords
, "git update-index -q --refresh " ++ fromMaybe "<file>" mf
]
{- Parses a symlink target or a pointer file to a Key. -}
{- Parses a symlink target or a pointer file to a Key.
-
- Makes sure that the pointer file is valid, including not being longer
- than the maximum allowed size of a valid pointer file, and that any
- subsequent lines after the first contain the validPointerLineTag.
- If a valid pointer file gets some other data appended to it, it should
- never be considered valid, unless that data happened to itself be a
- valid pointer file.
-}
parseLinkTargetOrPointer :: S.ByteString -> Maybe Key
parseLinkTargetOrPointer = go . S8.takeWhile (not . lineend)
parseLinkTargetOrPointer b
| S.length b <= maxValidPointerSz =
let (firstline, rest) = S8.span (/= '\n') b
in case parsekey $ droptrailing '\r' firstline of
Just k | restvalid (dropleading '\n' rest) -> Just k
_ -> Nothing
| otherwise = Nothing
where
go l
parsekey l
| isLinkToAnnex l = fileKey $ snd $ S8.breakEnd pathsep l
| otherwise = Nothing
restvalid r
| S.null r = True
| otherwise =
let (l, r') = S8.span (/= '\n') r
in validPointerLineTag `S.isInfixOf` l
&& (not (S8.null r') && S8.head r' == '\n')
&& restvalid (S8.tail r')
dropleading c l
| S.null l = l
| S8.head l == c = S8.tail l
| otherwise = l
lineend '\n' = True
lineend '\r' = True
lineend _ = False
droptrailing c l
| S.null l = l
| S8.last l == c = S8.init l
| otherwise = l
pathsep '/' = True
#ifdef mingw32_HOST_OS
@ -332,9 +360,17 @@ formatPointer k = prefix <> keyFile k <> nl
-
- 8192 bytes is plenty for a pointer to a key. This adds some additional
- padding to allow for pointer files that have lines of additional data
- after the key. -}
- after the key.
-
- One additional byte is used to detect when a valid pointer file
- got something else appended to it.
-}
maxPointerSz :: Int
maxPointerSz = 81920
maxPointerSz = maxValidPointerSz + 1
{- Maximum size of a valid pointer files is 32kb. -}
maxValidPointerSz :: Int
maxValidPointerSz = 32768
maxSymlinkSz :: Int
maxSymlinkSz = 8192
@ -387,3 +423,7 @@ isLinkToAnnex s = p `S.isInfixOf` s
#ifdef mingw32_HOST_OS
p' = toInternalGitPath p
#endif
{- String that must appear on every line of a valid pointer file. -}
validPointerLineTag :: S.ByteString
validPointerLineTag = "/annex/"

View file

@ -1,3 +1,12 @@
git-annex (10.20220223) UNRELEASED; urgency=medium
* Detect when an unlocked file whose content is not present has gotten
some other content appended to it, and avoid treating it as a pointer
file, so that appended content will not be checked into git, but will
be annexed like any other file.
-- Joey Hess <id@joeyh.name> Wed, 23 Feb 2022 14:14:09 -0400
git-annex (10.20220222) upstream; urgency=medium
* annex.skipunknown now defaults to false, so commands like

View file

@ -0,0 +1,30 @@
[[!comment format=mdwn
username="joey"
subject="""comment 4"""
date="2022-02-23T16:45:24Z"
content="""
I've now specified a format in [[internals/pointer_file]], which is
designed to allow detecting accidental appends.
And git-annex will now treat a pointer file that has been appeneded to as
not a pointer file any longer.
So, for example:
joey@darkstar:/tmp/r>echo oops >> foo
joey@darkstar:/tmp/r>cat foo
/annex/objects/SHA256E-s14169--bdcf6188db530bc3af79c898208ce2a56df6197f59b3872b03613a248ac8faf4
oops
joey@darkstar:/tmp/r>git add foo
joey@darkstar:/tmp/r>git diff --cached foo | tail -n 2
-/annex/objects/SHA256E-s14169--bdcf6188db530bc3af79c898208ce2a56df6197f59b3872b03613a248ac8faf4
+/annex/objects/SHA256E-s101--b7da3d6b0ad2f6a2a263e783e59efb60f2520f03bb36cea35a556a684b0d5c9d
Since the file is not a valid pointer file after being appended to,
git add does what it would do with any file, in this case adding the
content to the annex.
So at least it keeps the possibly large appeneded content out of git now.
I think that's the most important thing. Detecting and warning about
pointer files that are not valid due to appends should be easy from here.
"""]]

View file

@ -7,7 +7,7 @@ some documentation to that end.
### `.git/annex/objects/aa/bb/*/*`
This is where locally available file contents are actually stored.
Files added to the annex get a symlink or pointer file checked into git,
Files added to the annex get a symlink or [[pointer_file]] checked into git,
that points to the file content.
First there are two levels of directories used for hashing, to prevent

View file

@ -0,0 +1,25 @@
A pointer file is one of two ways that an annex object can be checked into
git. The other is a symbolic link pointing to a file in the
.git/annex/objects/ directory.
A pointer file starts with "/annex/objects/", which is followed
by the key (see [[key_format]]). (In some situations a pointer file
might instead contain the content of a symlink target.)
Pointer files usually have a newline after the key. This is not required.
A carriage return followed by a newline is also accepted, as is end of file.
After that, there is usually nothing more in a pointer file, but git-annex
does support pointer files with additional text on subsequent lines.
Every such subsequent line has to contain "/annex/" somewhere in it,
and end in a newline. Otherwise it not considered to be a valid pointer file.
The maximum size of a pointer file is 32 kb. If it is any longer, it is not
considered to be a valid pointer file.
The possibility exists that a pointer file is in a working tree,
representing an annex object that is not present, and something appends
data onto it accidentally. The limitation that each line of a valid
pointer file contains "/annex/" and that it cannot be larger than 32kb
let such a situation be detected.