support VURL backend
Not yet implemented is recording hashes on download from web and verifying hashes. addurl --verifiable option added with -V short option because I expect a lot of people will want to use this. It seems likely that --verifiable will become the default eventually, and possibly rather soon. While old git-annex versions don't support VURL, that doesn't prevent using them with keys that use VURL. Of course, they won't verify the content on transfer, and fsck will warn that it doesn't know about VURL. So there's not much problem with starting to use VURL even when interoperating with old versions. Sponsored-by: Joshua Antonishen on Patreon
This commit is contained in:
parent
8f40e0269b
commit
0f7143d226
12 changed files with 127 additions and 33 deletions
|
@ -506,7 +506,7 @@ gitAnnexWebCertificate r = fromRawFilePath $ gitAnnexDir r P.</> "certificate.pe
|
|||
gitAnnexWebPrivKey :: Git.Repo -> FilePath
|
||||
gitAnnexWebPrivKey r = fromRawFilePath $ gitAnnexDir r P.</> "privkey.pem"
|
||||
|
||||
{- .git/annex/feeds/ is used to record per-key (url) state by importfeeds -}
|
||||
{- .git/annex/feeds/ is used to record per-key (url) state by importfeed -}
|
||||
gitAnnexFeedStateDir :: Git.Repo -> RawFilePath
|
||||
gitAnnexFeedStateDir r = P.addTrailingPathSeparator $
|
||||
gitAnnexDir r P.</> "feedstate"
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
{- git-annex "URL" backend -- keys whose content is available from urls.
|
||||
{- git-annex "URL" and "VURL" backends -- keys whose content is
|
||||
- available from urls.
|
||||
-
|
||||
- Copyright 2011 Joey Hess <id@joeyh.name>
|
||||
- Copyright 2011-2024 Joey Hess <id@joeyh.name>
|
||||
-
|
||||
- Licensed under the GNU AGPL version 3 or higher.
|
||||
-}
|
||||
|
@ -16,10 +17,10 @@ import Types.Backend
|
|||
import Backend.Utilities
|
||||
|
||||
backends :: [Backend]
|
||||
backends = [backend]
|
||||
backends = [backendURL, backendVURL]
|
||||
|
||||
backend :: Backend
|
||||
backend = Backend
|
||||
backendURL :: Backend
|
||||
backendURL = Backend
|
||||
{ backendVariety = URLKey
|
||||
, genKey = Nothing
|
||||
, verifyKeyContent = Nothing
|
||||
|
@ -32,10 +33,28 @@ backend = Backend
|
|||
, isCryptographicallySecure = False
|
||||
}
|
||||
|
||||
backendVURL :: Backend
|
||||
backendVURL = Backend
|
||||
{ backendVariety = VURLKey
|
||||
, genKey = Nothing
|
||||
, verifyKeyContent = Nothing -- TODO
|
||||
, verifyKeyContentIncrementally = Nothing -- TODO
|
||||
, canUpgradeKey = Nothing
|
||||
, fastMigrate = Nothing
|
||||
-- Even if a hash is recorded on initial download from the web and
|
||||
-- is used to verify every subsequent transfer including other
|
||||
-- downloads from the web, in a split-brain situation there
|
||||
-- can be more than one hash and different versions of the content.
|
||||
-- So the content is not stable.
|
||||
, isStableKey = const False
|
||||
, isCryptographicallySecure = False
|
||||
-- TODO it is when all recorded hashes are
|
||||
}
|
||||
|
||||
{- Every unique url has a corresponding key. -}
|
||||
fromUrl :: String -> Maybe Integer -> Key
|
||||
fromUrl url size = mkKey $ \k -> k
|
||||
fromUrl :: String -> Maybe Integer -> Bool -> Key
|
||||
fromUrl url size verifiable = mkKey $ \k -> k
|
||||
{ keyName = genKeyName url
|
||||
, keyVariety = URLKey
|
||||
, keyVariety = if verifiable then VURLKey else URLKey
|
||||
, keySize = size
|
||||
}
|
||||
|
|
|
@ -1,5 +1,9 @@
|
|||
git-annex (10.20240228) UNRELEASED; urgency=medium
|
||||
|
||||
* addurl, importfeed: Added --verifiable option, which improves
|
||||
the safety of --fast or --relaxed by letting the content of
|
||||
annexed files be verified with a checksum that is calculated
|
||||
on a later download from the web.
|
||||
* Added dependency on unbounded-delays.
|
||||
|
||||
-- Joey Hess <id@joeyh.name> Tue, 27 Feb 2024 13:07:10 -0400
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
{- git-annex command
|
||||
-
|
||||
- Copyright 2011-2021 Joey Hess <id@joeyh.name>
|
||||
- Copyright 2011-2024 Joey Hess <id@joeyh.name>
|
||||
-
|
||||
- Licensed under the GNU AGPL version 3 or higher.
|
||||
-}
|
||||
|
@ -62,6 +62,7 @@ data AddUrlOptions = AddUrlOptions
|
|||
|
||||
data DownloadOptions = DownloadOptions
|
||||
{ relaxedOption :: Bool
|
||||
, verifiableOption :: Bool
|
||||
, rawOption :: Bool
|
||||
, noRawOption :: Bool
|
||||
, rawExceptOption :: Maybe (DeferredParse Remote)
|
||||
|
@ -96,7 +97,12 @@ parseDownloadOptions :: Bool -> Parser DownloadOptions
|
|||
parseDownloadOptions withfileoptions = DownloadOptions
|
||||
<$> switch
|
||||
( long "relaxed"
|
||||
<> help "skip size check"
|
||||
<> help "accept whatever content is downloaded from web even if it changes"
|
||||
)
|
||||
<*> switch
|
||||
( long "verifiable"
|
||||
<> short 'V'
|
||||
<> help "improve later verification of --fast or --relaxed content"
|
||||
)
|
||||
<*> switch
|
||||
( long "raw"
|
||||
|
@ -215,7 +221,7 @@ performRemote addunlockedmatcher r o uri file sz = lookupKey file >>= \case
|
|||
|
||||
downloadRemoteFile :: AddUnlockedMatcher -> Remote -> DownloadOptions -> URLString -> RawFilePath -> Maybe Integer -> Annex (Maybe Key)
|
||||
downloadRemoteFile addunlockedmatcher r o uri file sz = checkCanAdd o file $ \canadd -> do
|
||||
let urlkey = Backend.URL.fromUrl uri sz
|
||||
let urlkey = Backend.URL.fromUrl uri sz (verifiableOption o)
|
||||
createWorkTreeDirectory (parentDir file)
|
||||
ifM (Annex.getRead Annex.fast <||> pure (relaxedOption o))
|
||||
( do
|
||||
|
@ -344,7 +350,7 @@ downloadWeb :: AddUnlockedMatcher -> DownloadOptions -> URLString -> Url.UrlInfo
|
|||
downloadWeb addunlockedmatcher o url urlinfo file =
|
||||
go =<< downloadWith' downloader urlkey webUUID url file
|
||||
where
|
||||
urlkey = addSizeUrlKey urlinfo $ Backend.URL.fromUrl url Nothing
|
||||
urlkey = addSizeUrlKey urlinfo $ Backend.URL.fromUrl url Nothing (verifiableOption o)
|
||||
downloader f p = Url.withUrlOptions $ downloadUrl False urlkey p Nothing [url] f
|
||||
go Nothing = return Nothing
|
||||
go (Just (tmp, backend)) = ifM (useYoutubeDl o <&&> liftIO (isHtmlFile (fromRawFilePath tmp)))
|
||||
|
@ -388,7 +394,7 @@ downloadWeb addunlockedmatcher o url urlinfo file =
|
|||
warning (UnquotedString dlcmd <> " did not download anything")
|
||||
return Nothing
|
||||
mediaurl = setDownloader url YoutubeDownloader
|
||||
mediakey = Backend.URL.fromUrl mediaurl Nothing
|
||||
mediakey = Backend.URL.fromUrl mediaurl Nothing (verifiableOption o)
|
||||
-- Does the already annexed file have the mediaurl
|
||||
-- as an url? If so nothing to do.
|
||||
alreadyannexed dest k = do
|
||||
|
@ -436,7 +442,7 @@ startingAddUrl si url o p = starting "addurl" ai si $ do
|
|||
-- used to prevent two threads running concurrently when that would
|
||||
-- likely fail.
|
||||
ai = OnlyActionOn urlkey (ActionItemOther (Just (UnquotedString url)))
|
||||
urlkey = Backend.URL.fromUrl url Nothing
|
||||
urlkey = Backend.URL.fromUrl url Nothing (verifiableOption (downloadOptions o))
|
||||
|
||||
showDestinationFile :: RawFilePath -> Annex ()
|
||||
showDestinationFile file = do
|
||||
|
@ -539,12 +545,12 @@ nodownloadWeb addunlockedmatcher o url urlinfo file
|
|||
return Nothing
|
||||
where
|
||||
nomedia = do
|
||||
let key = Backend.URL.fromUrl url (Url.urlSize urlinfo)
|
||||
let key = Backend.URL.fromUrl url (Url.urlSize urlinfo) (verifiableOption o)
|
||||
nodownloadWeb' o addunlockedmatcher url key file
|
||||
usemedia mediafile = do
|
||||
let dest = youtubeDlDestFile o file mediafile
|
||||
let mediaurl = setDownloader url YoutubeDownloader
|
||||
let mediakey = Backend.URL.fromUrl mediaurl Nothing
|
||||
let mediakey = Backend.URL.fromUrl mediaurl Nothing (verifiableOption o)
|
||||
nodownloadWeb' o addunlockedmatcher mediaurl mediakey dest
|
||||
|
||||
youtubeDlDestFile :: DownloadOptions -> RawFilePath -> RawFilePath -> RawFilePath
|
||||
|
|
|
@ -94,7 +94,7 @@ keyOpt = either giveup id . keyOpt'
|
|||
keyOpt' :: String -> Either String Key
|
||||
keyOpt' s = case parseURIPortable s of
|
||||
Just u | not (isKeyPrefix (uriScheme u)) ->
|
||||
Right $ Backend.URL.fromUrl s Nothing
|
||||
Right $ Backend.URL.fromUrl s Nothing False
|
||||
_ -> case deserializeKey s of
|
||||
Just k -> Right k
|
||||
Nothing -> Left $ "bad key/url " ++ s
|
||||
|
|
|
@ -283,7 +283,7 @@ startDownload addunlockedmatcher opts cache cv todownload = case location todown
|
|||
Enclosure url -> startdownloadenclosure url
|
||||
MediaLink linkurl -> do
|
||||
let mediaurl = setDownloader linkurl YoutubeDownloader
|
||||
let mediakey = Backend.URL.fromUrl mediaurl Nothing
|
||||
let mediakey = Backend.URL.fromUrl mediaurl Nothing (verifiableOption (downloadOptions opts))
|
||||
-- Old versions of git-annex that used quvi might have
|
||||
-- used the quviurl for this, so check if it's known
|
||||
-- to avoid adding it a second time.
|
||||
|
@ -638,7 +638,7 @@ clearFeedProblem url =
|
|||
=<< feedState url
|
||||
|
||||
feedState :: URLString -> Annex RawFilePath
|
||||
feedState url = fromRepo $ gitAnnexFeedState $ fromUrl url Nothing
|
||||
feedState url = fromRepo $ gitAnnexFeedState $ fromUrl url Nothing False
|
||||
|
||||
{- The feed library parses the feed to Text, and does not use the
|
||||
- filesystem encoding to do it, so when the locale is not unicode
|
||||
|
|
|
@ -173,7 +173,7 @@ torrentUrlNum u
|
|||
|
||||
{- A Key corresponding to the URL of a torrent file. -}
|
||||
torrentUrlKey :: URLString -> Annex Key
|
||||
torrentUrlKey u = return $ fromUrl (fst $ torrentUrlNum u) Nothing
|
||||
torrentUrlKey u = return $ fromUrl (fst $ torrentUrlNum u) Nothing False
|
||||
|
||||
{- Temporary filename to use to store the torrent file. -}
|
||||
tmpTorrentFile :: URLString -> Annex RawFilePath
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
{- git-annex Key data type
|
||||
-
|
||||
- Copyright 2011-2020 Joey Hess <id@joeyh.name>
|
||||
- Copyright 2011-2024 Joey Hess <id@joeyh.name>
|
||||
-
|
||||
- Licensed under the GNU AGPL version 3 or higher.
|
||||
-}
|
||||
|
@ -218,6 +218,7 @@ data KeyVariety
|
|||
| MD5Key HasExt
|
||||
| WORMKey
|
||||
| URLKey
|
||||
| VURLKey
|
||||
-- A key that is handled by some external backend.
|
||||
| ExternalKey S.ByteString HasExt
|
||||
-- Some repositories may contain keys of other varieties,
|
||||
|
@ -251,6 +252,7 @@ hasExt (SHA1Key (HasExt b)) = b
|
|||
hasExt (MD5Key (HasExt b)) = b
|
||||
hasExt WORMKey = False
|
||||
hasExt URLKey = False
|
||||
hasExt VURLKey = False
|
||||
hasExt (ExternalKey _ (HasExt b)) = b
|
||||
hasExt (OtherKey s) = (snd <$> S8.unsnoc s) == Just 'E'
|
||||
|
||||
|
@ -279,6 +281,7 @@ formatKeyVariety v = case v of
|
|||
MD5Key e -> adde e "MD5"
|
||||
WORMKey -> "WORM"
|
||||
URLKey -> "URL"
|
||||
VURLKey -> "VURL"
|
||||
ExternalKey s e -> adde e ("X" <> s)
|
||||
OtherKey s -> s
|
||||
where
|
||||
|
@ -343,6 +346,7 @@ parseKeyVariety "MD5" = MD5Key (HasExt False)
|
|||
parseKeyVariety "MD5E" = MD5Key (HasExt True)
|
||||
parseKeyVariety "WORM" = WORMKey
|
||||
parseKeyVariety "URL" = URLKey
|
||||
parseKeyVariety "VURL" = VURLKey
|
||||
parseKeyVariety b
|
||||
| "X" `S.isPrefixOf` b =
|
||||
let b' = S.tail b
|
||||
|
|
|
@ -54,6 +54,10 @@ in `.gitattributes`:
|
|||
`BLAKE2SP224E`, `BLAKE2SP256E`
|
||||
-- Fast [Blake2 hash](https://blake2.net/) variants optimised for
|
||||
8-way CPUs.
|
||||
`VURL` -- This is like an `URL` (see below) but the content can
|
||||
be verified with a cryptographically secure checksum that is
|
||||
recorded in the git-annex branch. It's generated when using
|
||||
eg `git-annex addurl --fast --verifiable`.
|
||||
|
||||
## non-cryptograpgically secure backends
|
||||
|
||||
|
@ -68,10 +72,11 @@ content of an annexed file remains unchanged.
|
|||
files or slow systems.
|
||||
* `URL` -- This is a key that is generated from the url to a file.
|
||||
It's generated when using eg, `git annex addurl --fast`, when the file
|
||||
content is not available for hashing. The key may not contain the full
|
||||
URL; for long URLs, part of the URL may be represented by a checksum.
|
||||
content is not available for hashing.
|
||||
The key may not contain the full URL; for long URLs, part of the URL may be
|
||||
represented by a checksum.
|
||||
The URL key may contain `&` characters; be sure to quote the key if
|
||||
passing it to a shell script. The URL-backend key is distinct from URLs/URIs
|
||||
passing it to a shell script. These types of keys are distinct from URLs/URIs
|
||||
that may be attached to a key (using any backend) indicating the key's location
|
||||
on the web or in one of [[special_remotes]].
|
||||
* `GIT` -- This is used internally by git-annex when exporting trees
|
||||
|
|
|
@ -42,7 +42,24 @@ be used to get better filenames.
|
|||
This is the fastest option, but it still has to access the network
|
||||
to check if the url contains embedded media. When adding large numbers
|
||||
of urls, using `--relaxed --raw` is much faster.
|
||||
|
||||
|
||||
* `--verifiable` `-V`
|
||||
|
||||
This can be used with the `--fast` or `--relaxed` option. It improves
|
||||
the safety of the resulting annexed file, by letting its content be
|
||||
verified with a checksum when it is transferred between git-annex
|
||||
repositories, as well as by things like `git-annex fsck`.
|
||||
|
||||
When used with --relaxed, content from the web will always be accepted,
|
||||
even if it has changed, and the checksum recorded for later verification.
|
||||
|
||||
When used with --fast, the checksum is recorded the first time the
|
||||
content is downloaded from the web. Once a checksum has been recorded,
|
||||
subsequent downloads from the web must have the same checksum.
|
||||
|
||||
Note that this option currently only has an effect when using the
|
||||
web special remote, not other special remotes that handle urls.
|
||||
|
||||
* `--raw`
|
||||
|
||||
Prevent special handling of urls by yt-dlp, and by bittorrent
|
||||
|
|
|
@ -37,7 +37,7 @@ resulting in the new url being downloaded to such a filename.
|
|||
|
||||
Force downloading items it's seen before.
|
||||
|
||||
* `--relaxed`, `--fast`, `--raw`, `--raw-except`
|
||||
* `--fast`, `--relaxed`, `--verifiable`, `--raw`, `--raw-except`
|
||||
|
||||
These options behave the same as when using [[git-annex-addurl]](1).
|
||||
|
||||
|
|
|
@ -24,23 +24,46 @@ the web with its hash recorded, but has since gotten corrupted.
|
|||
|
||||
Seems that the only possible way to resolve this problem is to change to a
|
||||
new type of url key, that is known to always have its hash recorded on
|
||||
download from the web. (Call this a "dynamic" url key.)
|
||||
download from the web. (Call this a verifiable url key: a VURL.)
|
||||
And handle all existing relaxed url keys as before.
|
||||
|
||||
That would leave it up to the user to migrate their relaxed url keys to
|
||||
dynamic urls keys, if desired. Now that distributed migration is
|
||||
That would leave it up to the user to migrate their URL keys to
|
||||
VURL keys, if desired. Now that distributed migration is
|
||||
implemented, that seems sufficiently easy.
|
||||
|
||||
## addurl --fast
|
||||
|
||||
Using addurl --fast rather than --relaxed records the size but doesn't
|
||||
hash. So it has the same problem that data corruption can go unnoticed,
|
||||
only the data corruption has to involve bit flips and not truncation.
|
||||
|
||||
So it seems that --fast ought to also be handled. The difference being that
|
||||
an url added with --fast is expected to always be the same on re-download
|
||||
from the web, while an url added with --relaxed may change its content on
|
||||
re-download from the web while being still considered the same object.
|
||||
|
||||
This can also use a VURL key, but include the size in it. When downloading
|
||||
a sized VURL, the web special remote will hash the content, and verify
|
||||
that either no hash has been recorded before (and record the hash when the
|
||||
size matches), or that it matches the previously recorded hash.
|
||||
|
||||
Note that, if an url is added with --fast and that gets committed and
|
||||
pulled by another repo, and then later both repos download the content
|
||||
from the web, it would be possible for the web to serve up different
|
||||
content to the two, and in that case either hash would be treated as
|
||||
valid.
|
||||
|
||||
## other special remotes
|
||||
|
||||
If the web special remote is what takes care of hashing the content on
|
||||
download and recording the hash-based key, what about other special remotes
|
||||
that claim an url?
|
||||
|
||||
This could also be implemented in the bittorrent special remote, but what
|
||||
This could also be implemented in the bittorrent special remote
|
||||
(though ), but what
|
||||
about external special remotes?
|
||||
|
||||
An alternative would be to add a downloadDynamicUrl that is called instead
|
||||
An alternative would be to add a downloadVerifiedUrl that is called instead
|
||||
of retrieveKeyFile and returns a hash-based key (allowing hashing the
|
||||
download on the fly). Then git-annex would take
|
||||
care of recording the hash-based key. The external special remote interface
|
||||
|
@ -50,10 +73,26 @@ could be extended to include that.
|
|||
|
||||
Should annex.backend gitconfig be used to pick which hash-based key to use?
|
||||
The risk is that config changes and several different hash-based keys get
|
||||
recorded for a dynamic url. Not really a problem, but would increase the
|
||||
recorded for a VURL. Not really a problem, but would increase the
|
||||
size of the git-annex branch unncessarily, and require extra work when
|
||||
verifying the key.)
|
||||
|
||||
What if annex.backend uses WORM or something that is not hash-based?
|
||||
Seems it ought to fall back to SHA256 or something then.
|
||||
|
||||
To support annex.securehashesonly it would be good if only
|
||||
cryptographically secure hashes were recorded for a VURL. But of course,
|
||||
which hashes are considered secure can change. Still, let's start by
|
||||
only allowing currently secure hashes to be used for VURLs. This way,
|
||||
when there are multiple hashes recorded for a VURL, they will all be
|
||||
cryptographically secure, and so the VURL can have
|
||||
`isCryptographicallySecure = True`. If any of the hashes later becomes
|
||||
broken, the VURL will no longer be treated as cryptographically secure,
|
||||
because the broken hash can be used to verify its content.
|
||||
In that case, the user would probably just migrate to a hash-based key,
|
||||
although perhaps something VURL-specific could be built to upgrade its
|
||||
hashes.
|
||||
|
||||
## use for other types of keys
|
||||
|
||||
It would also be possible to use these new git-annex branch log files
|
||||
|
|
Loading…
Reference in a new issue