support VURL backend

Not yet implemented is recording hashes on download from web and
verifying hashes.

addurl --verifiable option added with -V short option because I
expect a lot of people will want to use this.

It seems likely that --verifiable will become the default eventually,
and possibly rather soon. While old git-annex versions don't support
VURL, that doesn't prevent using them with keys that use VURL. Of
course, they won't verify the content on transfer, and fsck will warn
that it doesn't know about VURL. So there's not much problem with
starting to use VURL even when interoperating with old versions.

Sponsored-by: Joshua Antonishen on Patreon
This commit is contained in:
Joey Hess 2024-02-29 13:26:06 -04:00
parent 8f40e0269b
commit 0f7143d226
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
12 changed files with 127 additions and 33 deletions

View file

@ -506,7 +506,7 @@ gitAnnexWebCertificate r = fromRawFilePath $ gitAnnexDir r P.</> "certificate.pe
gitAnnexWebPrivKey :: Git.Repo -> FilePath
gitAnnexWebPrivKey r = fromRawFilePath $ gitAnnexDir r P.</> "privkey.pem"
{- .git/annex/feeds/ is used to record per-key (url) state by importfeeds -}
{- .git/annex/feeds/ is used to record per-key (url) state by importfeed -}
gitAnnexFeedStateDir :: Git.Repo -> RawFilePath
gitAnnexFeedStateDir r = P.addTrailingPathSeparator $
gitAnnexDir r P.</> "feedstate"

View file

@ -1,6 +1,7 @@
{- git-annex "URL" backend -- keys whose content is available from urls.
{- git-annex "URL" and "VURL" backends -- keys whose content is
- available from urls.
-
- Copyright 2011 Joey Hess <id@joeyh.name>
- Copyright 2011-2024 Joey Hess <id@joeyh.name>
-
- Licensed under the GNU AGPL version 3 or higher.
-}
@ -16,10 +17,10 @@ import Types.Backend
import Backend.Utilities
backends :: [Backend]
backends = [backend]
backends = [backendURL, backendVURL]
backend :: Backend
backend = Backend
backendURL :: Backend
backendURL = Backend
{ backendVariety = URLKey
, genKey = Nothing
, verifyKeyContent = Nothing
@ -32,10 +33,28 @@ backend = Backend
, isCryptographicallySecure = False
}
backendVURL :: Backend
backendVURL = Backend
{ backendVariety = VURLKey
, genKey = Nothing
, verifyKeyContent = Nothing -- TODO
, verifyKeyContentIncrementally = Nothing -- TODO
, canUpgradeKey = Nothing
, fastMigrate = Nothing
-- Even if a hash is recorded on initial download from the web and
-- is used to verify every subsequent transfer including other
-- downloads from the web, in a split-brain situation there
-- can be more than one hash and different versions of the content.
-- So the content is not stable.
, isStableKey = const False
, isCryptographicallySecure = False
-- TODO it is when all recorded hashes are
}
{- Every unique url has a corresponding key. -}
fromUrl :: String -> Maybe Integer -> Key
fromUrl url size = mkKey $ \k -> k
fromUrl :: String -> Maybe Integer -> Bool -> Key
fromUrl url size verifiable = mkKey $ \k -> k
{ keyName = genKeyName url
, keyVariety = URLKey
, keyVariety = if verifiable then VURLKey else URLKey
, keySize = size
}

View file

@ -1,5 +1,9 @@
git-annex (10.20240228) UNRELEASED; urgency=medium
* addurl, importfeed: Added --verifiable option, which improves
the safety of --fast or --relaxed by letting the content of
annexed files be verified with a checksum that is calculated
on a later download from the web.
* Added dependency on unbounded-delays.
-- Joey Hess <id@joeyh.name> Tue, 27 Feb 2024 13:07:10 -0400

View file

@ -1,6 +1,6 @@
{- git-annex command
-
- Copyright 2011-2021 Joey Hess <id@joeyh.name>
- Copyright 2011-2024 Joey Hess <id@joeyh.name>
-
- Licensed under the GNU AGPL version 3 or higher.
-}
@ -62,6 +62,7 @@ data AddUrlOptions = AddUrlOptions
data DownloadOptions = DownloadOptions
{ relaxedOption :: Bool
, verifiableOption :: Bool
, rawOption :: Bool
, noRawOption :: Bool
, rawExceptOption :: Maybe (DeferredParse Remote)
@ -96,7 +97,12 @@ parseDownloadOptions :: Bool -> Parser DownloadOptions
parseDownloadOptions withfileoptions = DownloadOptions
<$> switch
( long "relaxed"
<> help "skip size check"
<> help "accept whatever content is downloaded from web even if it changes"
)
<*> switch
( long "verifiable"
<> short 'V'
<> help "improve later verification of --fast or --relaxed content"
)
<*> switch
( long "raw"
@ -215,7 +221,7 @@ performRemote addunlockedmatcher r o uri file sz = lookupKey file >>= \case
downloadRemoteFile :: AddUnlockedMatcher -> Remote -> DownloadOptions -> URLString -> RawFilePath -> Maybe Integer -> Annex (Maybe Key)
downloadRemoteFile addunlockedmatcher r o uri file sz = checkCanAdd o file $ \canadd -> do
let urlkey = Backend.URL.fromUrl uri sz
let urlkey = Backend.URL.fromUrl uri sz (verifiableOption o)
createWorkTreeDirectory (parentDir file)
ifM (Annex.getRead Annex.fast <||> pure (relaxedOption o))
( do
@ -344,7 +350,7 @@ downloadWeb :: AddUnlockedMatcher -> DownloadOptions -> URLString -> Url.UrlInfo
downloadWeb addunlockedmatcher o url urlinfo file =
go =<< downloadWith' downloader urlkey webUUID url file
where
urlkey = addSizeUrlKey urlinfo $ Backend.URL.fromUrl url Nothing
urlkey = addSizeUrlKey urlinfo $ Backend.URL.fromUrl url Nothing (verifiableOption o)
downloader f p = Url.withUrlOptions $ downloadUrl False urlkey p Nothing [url] f
go Nothing = return Nothing
go (Just (tmp, backend)) = ifM (useYoutubeDl o <&&> liftIO (isHtmlFile (fromRawFilePath tmp)))
@ -388,7 +394,7 @@ downloadWeb addunlockedmatcher o url urlinfo file =
warning (UnquotedString dlcmd <> " did not download anything")
return Nothing
mediaurl = setDownloader url YoutubeDownloader
mediakey = Backend.URL.fromUrl mediaurl Nothing
mediakey = Backend.URL.fromUrl mediaurl Nothing (verifiableOption o)
-- Does the already annexed file have the mediaurl
-- as an url? If so nothing to do.
alreadyannexed dest k = do
@ -436,7 +442,7 @@ startingAddUrl si url o p = starting "addurl" ai si $ do
-- used to prevent two threads running concurrently when that would
-- likely fail.
ai = OnlyActionOn urlkey (ActionItemOther (Just (UnquotedString url)))
urlkey = Backend.URL.fromUrl url Nothing
urlkey = Backend.URL.fromUrl url Nothing (verifiableOption (downloadOptions o))
showDestinationFile :: RawFilePath -> Annex ()
showDestinationFile file = do
@ -539,12 +545,12 @@ nodownloadWeb addunlockedmatcher o url urlinfo file
return Nothing
where
nomedia = do
let key = Backend.URL.fromUrl url (Url.urlSize urlinfo)
let key = Backend.URL.fromUrl url (Url.urlSize urlinfo) (verifiableOption o)
nodownloadWeb' o addunlockedmatcher url key file
usemedia mediafile = do
let dest = youtubeDlDestFile o file mediafile
let mediaurl = setDownloader url YoutubeDownloader
let mediakey = Backend.URL.fromUrl mediaurl Nothing
let mediakey = Backend.URL.fromUrl mediaurl Nothing (verifiableOption o)
nodownloadWeb' o addunlockedmatcher mediaurl mediakey dest
youtubeDlDestFile :: DownloadOptions -> RawFilePath -> RawFilePath -> RawFilePath

View file

@ -94,7 +94,7 @@ keyOpt = either giveup id . keyOpt'
keyOpt' :: String -> Either String Key
keyOpt' s = case parseURIPortable s of
Just u | not (isKeyPrefix (uriScheme u)) ->
Right $ Backend.URL.fromUrl s Nothing
Right $ Backend.URL.fromUrl s Nothing False
_ -> case deserializeKey s of
Just k -> Right k
Nothing -> Left $ "bad key/url " ++ s

View file

@ -283,7 +283,7 @@ startDownload addunlockedmatcher opts cache cv todownload = case location todown
Enclosure url -> startdownloadenclosure url
MediaLink linkurl -> do
let mediaurl = setDownloader linkurl YoutubeDownloader
let mediakey = Backend.URL.fromUrl mediaurl Nothing
let mediakey = Backend.URL.fromUrl mediaurl Nothing (verifiableOption (downloadOptions opts))
-- Old versions of git-annex that used quvi might have
-- used the quviurl for this, so check if it's known
-- to avoid adding it a second time.
@ -638,7 +638,7 @@ clearFeedProblem url =
=<< feedState url
feedState :: URLString -> Annex RawFilePath
feedState url = fromRepo $ gitAnnexFeedState $ fromUrl url Nothing
feedState url = fromRepo $ gitAnnexFeedState $ fromUrl url Nothing False
{- The feed library parses the feed to Text, and does not use the
- filesystem encoding to do it, so when the locale is not unicode

View file

@ -173,7 +173,7 @@ torrentUrlNum u
{- A Key corresponding to the URL of a torrent file. -}
torrentUrlKey :: URLString -> Annex Key
torrentUrlKey u = return $ fromUrl (fst $ torrentUrlNum u) Nothing
torrentUrlKey u = return $ fromUrl (fst $ torrentUrlNum u) Nothing False
{- Temporary filename to use to store the torrent file. -}
tmpTorrentFile :: URLString -> Annex RawFilePath

View file

@ -1,6 +1,6 @@
{- git-annex Key data type
-
- Copyright 2011-2020 Joey Hess <id@joeyh.name>
- Copyright 2011-2024 Joey Hess <id@joeyh.name>
-
- Licensed under the GNU AGPL version 3 or higher.
-}
@ -218,6 +218,7 @@ data KeyVariety
| MD5Key HasExt
| WORMKey
| URLKey
| VURLKey
-- A key that is handled by some external backend.
| ExternalKey S.ByteString HasExt
-- Some repositories may contain keys of other varieties,
@ -251,6 +252,7 @@ hasExt (SHA1Key (HasExt b)) = b
hasExt (MD5Key (HasExt b)) = b
hasExt WORMKey = False
hasExt URLKey = False
hasExt VURLKey = False
hasExt (ExternalKey _ (HasExt b)) = b
hasExt (OtherKey s) = (snd <$> S8.unsnoc s) == Just 'E'
@ -279,6 +281,7 @@ formatKeyVariety v = case v of
MD5Key e -> adde e "MD5"
WORMKey -> "WORM"
URLKey -> "URL"
VURLKey -> "VURL"
ExternalKey s e -> adde e ("X" <> s)
OtherKey s -> s
where
@ -343,6 +346,7 @@ parseKeyVariety "MD5" = MD5Key (HasExt False)
parseKeyVariety "MD5E" = MD5Key (HasExt True)
parseKeyVariety "WORM" = WORMKey
parseKeyVariety "URL" = URLKey
parseKeyVariety "VURL" = VURLKey
parseKeyVariety b
| "X" `S.isPrefixOf` b =
let b' = S.tail b

View file

@ -54,6 +54,10 @@ in `.gitattributes`:
`BLAKE2SP224E`, `BLAKE2SP256E`
-- Fast [Blake2 hash](https://blake2.net/) variants optimised for
8-way CPUs.
`VURL` -- This is like an `URL` (see below) but the content can
be verified with a cryptographically secure checksum that is
recorded in the git-annex branch. It's generated when using
eg `git-annex addurl --fast --verifiable`.
## non-cryptograpgically secure backends
@ -68,10 +72,11 @@ content of an annexed file remains unchanged.
files or slow systems.
* `URL` -- This is a key that is generated from the url to a file.
It's generated when using eg, `git annex addurl --fast`, when the file
content is not available for hashing. The key may not contain the full
URL; for long URLs, part of the URL may be represented by a checksum.
content is not available for hashing.
The key may not contain the full URL; for long URLs, part of the URL may be
represented by a checksum.
The URL key may contain `&` characters; be sure to quote the key if
passing it to a shell script. The URL-backend key is distinct from URLs/URIs
passing it to a shell script. These types of keys are distinct from URLs/URIs
that may be attached to a key (using any backend) indicating the key's location
on the web or in one of [[special_remotes]].
* `GIT` -- This is used internally by git-annex when exporting trees

View file

@ -42,7 +42,24 @@ be used to get better filenames.
This is the fastest option, but it still has to access the network
to check if the url contains embedded media. When adding large numbers
of urls, using `--relaxed --raw` is much faster.
* `--verifiable` `-V`
This can be used with the `--fast` or `--relaxed` option. It improves
the safety of the resulting annexed file, by letting its content be
verified with a checksum when it is transferred between git-annex
repositories, as well as by things like `git-annex fsck`.
When used with --relaxed, content from the web will always be accepted,
even if it has changed, and the checksum recorded for later verification.
When used with --fast, the checksum is recorded the first time the
content is downloaded from the web. Once a checksum has been recorded,
subsequent downloads from the web must have the same checksum.
Note that this option currently only has an effect when using the
web special remote, not other special remotes that handle urls.
* `--raw`
Prevent special handling of urls by yt-dlp, and by bittorrent

View file

@ -37,7 +37,7 @@ resulting in the new url being downloaded to such a filename.
Force downloading items it's seen before.
* `--relaxed`, `--fast`, `--raw`, `--raw-except`
* `--fast`, `--relaxed`, `--verifiable`, `--raw`, `--raw-except`
These options behave the same as when using [[git-annex-addurl]](1).

View file

@ -24,23 +24,46 @@ the web with its hash recorded, but has since gotten corrupted.
Seems that the only possible way to resolve this problem is to change to a
new type of url key, that is known to always have its hash recorded on
download from the web. (Call this a "dynamic" url key.)
download from the web. (Call this a verifiable url key: a VURL.)
And handle all existing relaxed url keys as before.
That would leave it up to the user to migrate their relaxed url keys to
dynamic urls keys, if desired. Now that distributed migration is
That would leave it up to the user to migrate their URL keys to
VURL keys, if desired. Now that distributed migration is
implemented, that seems sufficiently easy.
## addurl --fast
Using addurl --fast rather than --relaxed records the size but doesn't
hash. So it has the same problem that data corruption can go unnoticed,
only the data corruption has to involve bit flips and not truncation.
So it seems that --fast ought to also be handled. The difference being that
an url added with --fast is expected to always be the same on re-download
from the web, while an url added with --relaxed may change its content on
re-download from the web while being still considered the same object.
This can also use a VURL key, but include the size in it. When downloading
a sized VURL, the web special remote will hash the content, and verify
that either no hash has been recorded before (and record the hash when the
size matches), or that it matches the previously recorded hash.
Note that, if an url is added with --fast and that gets committed and
pulled by another repo, and then later both repos download the content
from the web, it would be possible for the web to serve up different
content to the two, and in that case either hash would be treated as
valid.
## other special remotes
If the web special remote is what takes care of hashing the content on
download and recording the hash-based key, what about other special remotes
that claim an url?
This could also be implemented in the bittorrent special remote, but what
This could also be implemented in the bittorrent special remote
(though ), but what
about external special remotes?
An alternative would be to add a downloadDynamicUrl that is called instead
An alternative would be to add a downloadVerifiedUrl that is called instead
of retrieveKeyFile and returns a hash-based key (allowing hashing the
download on the fly). Then git-annex would take
care of recording the hash-based key. The external special remote interface
@ -50,10 +73,26 @@ could be extended to include that.
Should annex.backend gitconfig be used to pick which hash-based key to use?
The risk is that config changes and several different hash-based keys get
recorded for a dynamic url. Not really a problem, but would increase the
recorded for a VURL. Not really a problem, but would increase the
size of the git-annex branch unncessarily, and require extra work when
verifying the key.)
What if annex.backend uses WORM or something that is not hash-based?
Seems it ought to fall back to SHA256 or something then.
To support annex.securehashesonly it would be good if only
cryptographically secure hashes were recorded for a VURL. But of course,
which hashes are considered secure can change. Still, let's start by
only allowing currently secure hashes to be used for VURLs. This way,
when there are multiple hashes recorded for a VURL, they will all be
cryptographically secure, and so the VURL can have
`isCryptographicallySecure = True`. If any of the hashes later becomes
broken, the VURL will no longer be treated as cryptographically secure,
because the broken hash can be used to verify its content.
In that case, the user would probably just migrate to a hash-based key,
although perhaps something VURL-specific could be built to upgrade its
hashes.
## use for other types of keys
It would also be possible to use these new git-annex branch log files