default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
{- yt-dlp (and deprecated youtube-dl) integration for git-annex
|
2017-11-29 19:49:05 +00:00
|
|
|
-
|
2024-01-30 19:37:29 +00:00
|
|
|
- Copyright 2017-2024 Joey Hess <id@joeyh.name>
|
2017-11-29 19:49:05 +00:00
|
|
|
-
|
2019-03-13 19:48:14 +00:00
|
|
|
- Licensed under the GNU AGPL version 3 or higher.
|
2017-11-29 19:49:05 +00:00
|
|
|
-}
|
|
|
|
|
2024-01-30 19:37:29 +00:00
|
|
|
{-# LANGUAGE DeriveGeneric #-}
|
|
|
|
|
2017-12-11 16:46:34 +00:00
|
|
|
module Annex.YoutubeDl (
|
|
|
|
youtubeDl,
|
|
|
|
youtubeDlTo,
|
|
|
|
youtubeDlSupported,
|
|
|
|
youtubeDlCheck,
|
|
|
|
youtubeDlFileName,
|
limit url downloads to whitelisted schemes
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
2018-06-15 20:52:24 +00:00
|
|
|
youtubeDlFileNameHtmlOnly,
|
2023-06-20 18:55:25 +00:00
|
|
|
youtubeDlCommand,
|
2024-01-30 19:37:29 +00:00
|
|
|
youtubePlaylist,
|
|
|
|
YoutubePlaylistItem(..),
|
2017-12-11 16:46:34 +00:00
|
|
|
) where
|
2017-11-29 19:49:05 +00:00
|
|
|
|
|
|
|
import Annex.Common
|
|
|
|
import qualified Annex
|
|
|
|
import Annex.Content
|
2017-12-06 17:16:06 +00:00
|
|
|
import Annex.Url
|
2017-11-30 20:08:30 +00:00
|
|
|
import Utility.DiskFree
|
2017-12-06 17:16:06 +00:00
|
|
|
import Utility.HtmlDetect
|
2017-12-31 20:08:31 +00:00
|
|
|
import Utility.Process.Transcript
|
2020-09-29 21:53:48 +00:00
|
|
|
import Utility.Metered
|
2024-01-30 19:37:29 +00:00
|
|
|
import Utility.Tmp
|
2020-09-29 21:53:48 +00:00
|
|
|
import Messages.Progress
|
2017-11-30 20:08:30 +00:00
|
|
|
import Logs.Transfer
|
2017-11-29 19:49:05 +00:00
|
|
|
|
2017-12-11 16:46:34 +00:00
|
|
|
import Network.URI
|
2017-12-31 19:19:01 +00:00
|
|
|
import Control.Concurrent.Async
|
2020-09-29 21:53:48 +00:00
|
|
|
import Text.Read
|
2024-01-30 19:37:29 +00:00
|
|
|
import Data.Either
|
|
|
|
import qualified Data.Aeson as Aeson
|
|
|
|
import GHC.Generics
|
|
|
|
import qualified Data.ByteString as B
|
|
|
|
import qualified Data.ByteString.Char8 as B8
|
2017-12-11 16:46:34 +00:00
|
|
|
|
2018-12-30 19:51:20 +00:00
|
|
|
-- youtube-dl can follow redirects to anywhere, including potentially
|
2018-06-28 17:01:18 +00:00
|
|
|
-- localhost or a private address. So, it's only allowed to download
|
|
|
|
-- content if the user has allowed access to all addresses.
|
2018-06-17 18:46:22 +00:00
|
|
|
youtubeDlAllowed :: Annex Bool
|
2019-05-30 16:43:40 +00:00
|
|
|
youtubeDlAllowed = ipAddressesUnlimited
|
2018-06-17 18:46:22 +00:00
|
|
|
|
2018-06-28 17:01:18 +00:00
|
|
|
youtubeDlNotAllowedMessage :: String
|
|
|
|
youtubeDlNotAllowedMessage = unwords
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
[ "This url is supported by yt-dlp, but"
|
|
|
|
, "yt-dlp could potentially access any address, and the"
|
2019-05-30 16:43:40 +00:00
|
|
|
, "configuration of annex.security.allowed-ip-addresses"
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
, "does not allow that. Not using yt-dlp (or youtube-dl)."
|
2018-06-28 17:01:18 +00:00
|
|
|
]
|
|
|
|
|
2017-11-29 19:49:05 +00:00
|
|
|
-- Runs youtube-dl in a work directory, to download a single media file
|
2021-11-17 17:03:37 +00:00
|
|
|
-- from the url. Returns the path to the media file in the work directory.
|
2017-11-29 19:49:05 +00:00
|
|
|
--
|
2020-09-29 21:53:48 +00:00
|
|
|
-- Displays a progress meter as youtube-dl downloads.
|
|
|
|
--
|
2023-06-19 18:23:14 +00:00
|
|
|
-- If no file is downloaded, or the program is not installed,
|
|
|
|
-- returns Right Nothing.
|
2017-11-29 19:49:05 +00:00
|
|
|
--
|
2023-06-19 18:23:14 +00:00
|
|
|
-- youtube-dl can write to multiple files, either temporary files, or
|
|
|
|
-- multiple videos found at the url, and git-annex needs only one file.
|
|
|
|
-- So we need to find the destination file, and make sure there is not
|
|
|
|
-- more than one. With yt-dlp use --print-to-file to make it record the
|
|
|
|
-- file(s) it downloads. With youtube-dl, the best that can be done is
|
|
|
|
-- to require that the work directory end up with only 1 file in it.
|
|
|
|
-- (This can fail, but youtube-dl is deprecated, and they closed my
|
|
|
|
-- issue requesting something like --print-to-file;
|
|
|
|
-- <https://github.com/rg3/youtube-dl/issues/14864>)
|
2020-09-29 21:53:48 +00:00
|
|
|
youtubeDl :: URLString -> FilePath -> MeterUpdate -> Annex (Either String (Maybe FilePath))
|
|
|
|
youtubeDl url workdir p = ifM ipAddressesUnlimited
|
|
|
|
( withUrlOptions $ youtubeDl' url workdir p
|
2018-06-28 17:01:18 +00:00
|
|
|
, return $ Left youtubeDlNotAllowedMessage
|
2018-06-17 18:46:22 +00:00
|
|
|
)
|
limit url downloads to whitelisted schemes
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
2018-06-15 20:52:24 +00:00
|
|
|
|
2020-09-29 21:53:48 +00:00
|
|
|
youtubeDl' :: URLString -> FilePath -> MeterUpdate -> UrlOptions -> Annex (Either String (Maybe FilePath))
|
|
|
|
youtubeDl' url workdir p uo
|
2023-06-19 18:23:14 +00:00
|
|
|
| supportedScheme uo url = do
|
|
|
|
cmd <- youtubeDlCommand
|
|
|
|
ifM (liftIO $ inSearchPath cmd)
|
|
|
|
( runcmd cmd >>= \case
|
|
|
|
Right True -> downloadedfiles cmd >>= \case
|
|
|
|
(f:[]) -> return (Right (Just f))
|
2023-06-20 18:55:25 +00:00
|
|
|
[] -> return (nofiles cmd)
|
|
|
|
fs -> return (toomanyfiles cmd fs)
|
2023-06-19 18:23:14 +00:00
|
|
|
Right False -> workdirfiles >>= \case
|
|
|
|
[] -> return (Right Nothing)
|
2023-06-20 18:55:25 +00:00
|
|
|
_ -> return (Left $ cmd ++ " download is incomplete. Run the command again to resume.")
|
2023-06-19 18:23:14 +00:00
|
|
|
Left msg -> return (Left msg)
|
|
|
|
, return (Right Nothing)
|
|
|
|
)
|
2017-12-11 16:46:34 +00:00
|
|
|
| otherwise = return (Right Nothing)
|
2017-11-29 19:49:05 +00:00
|
|
|
where
|
2023-06-20 18:55:25 +00:00
|
|
|
nofiles cmd = Left $ cmd ++ " did not put any media in its work directory, perhaps it's been configured to store files somewhere else?"
|
|
|
|
toomanyfiles cmd fs = Left $ cmd ++ " downloaded multiple media files; git-annex is only able to deal with one per url: " ++ show fs
|
2023-06-19 18:23:14 +00:00
|
|
|
downloadedfiles cmd
|
|
|
|
| isytdlp cmd = liftIO $
|
2023-07-09 18:18:25 +00:00
|
|
|
(nub . lines <$> readFile filelistfile)
|
2023-06-19 18:23:14 +00:00
|
|
|
`catchIO` (pure . const [])
|
|
|
|
| otherwise = workdirfiles
|
|
|
|
workdirfiles = liftIO $ filter (/= filelistfile)
|
|
|
|
<$> (filterM (doesFileExist) =<< dirContents workdir)
|
|
|
|
filelistfile = workdir </> filelistfilebase
|
|
|
|
filelistfilebase = "git-annex-file-list-file"
|
2023-06-20 18:55:25 +00:00
|
|
|
isytdlp cmd = cmd == "yt-dlp"
|
2023-06-19 18:23:14 +00:00
|
|
|
runcmd cmd = youtubeDlMaxSize workdir >>= \case
|
2017-11-30 20:08:30 +00:00
|
|
|
Left msg -> return (Left msg)
|
|
|
|
Right maxsize -> do
|
2023-06-19 18:23:14 +00:00
|
|
|
opts <- youtubeDlOpts (dlopts cmd ++ maxsize)
|
2020-09-29 21:53:48 +00:00
|
|
|
oh <- mkOutputHandlerQuiet
|
|
|
|
-- The size is unknown to start. Once youtube-dl
|
|
|
|
-- outputs some progress, the meter will be updated
|
|
|
|
-- with the size, which is why it's important the
|
|
|
|
-- meter is passed into commandMeter'
|
|
|
|
let unknownsize = Nothing :: Maybe FileSize
|
bwlimit
Added annex.bwlimit and remote.name.annex-bwlimit config that works for git
remotes and many but not all special remotes.
This nearly works, at least for a git remote on the same disk. With it set
to 100kb/1s, the meter displays an actual bandwidth of 128 kb/s, with
occasional spikes to 160 kb/s. So it needs to delay just a bit longer...
I'm unsure why.
However, at the beginning a lot of data flows before it determines the
right bandwidth limit. A granularity of less than 1s would probably improve
that.
And, I don't know yet if it makes sense to have it be 100ks/1s rather than
100kb/s. Is there a situation where the user would want a larger
granularity? Does granulatity need to be configurable at all? I only used that
format for the config really in order to reuse an existing parser.
This can't support for external special remotes, or for ones that
themselves shell out to an external command. (Well, it could, but it
would involve pausing and resuming the child process tree, which seems
very hard to implement and very strange besides.) There could also be some
built-in special remotes that it still doesn't work for, due to them not
having a progress meter whose displays blocks the bandwidth using thread.
But I don't think there are actually any that run a separate thread for
downloads than the thread that displays the progress meter.
Sponsored-by: Graham Spencer on Patreon
2021-09-21 20:58:02 +00:00
|
|
|
ok <- metered (Just p) unknownsize Nothing $ \meter meterupdate ->
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
liftIO $ commandMeter'
|
2023-06-19 18:23:14 +00:00
|
|
|
(if isytdlp cmd then parseYtdlpProgress else parseYoutubeDlProgress)
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
oh (Just meter) meterupdate cmd opts
|
2020-09-29 21:53:48 +00:00
|
|
|
(\pr -> pr { cwd = Just workdir })
|
2017-11-30 20:08:30 +00:00
|
|
|
return (Right ok)
|
2023-06-19 18:23:14 +00:00
|
|
|
dlopts cmd =
|
2017-11-30 20:08:30 +00:00
|
|
|
[ Param url
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
-- To make it only download one file when given a
|
2017-11-30 20:08:30 +00:00
|
|
|
-- page with a video and a playlist, download only the video.
|
|
|
|
, Param "--no-playlist"
|
|
|
|
-- And when given a page with only a playlist, download only
|
|
|
|
-- the first video on the playlist. (Assumes the video is
|
|
|
|
-- somewhat stable, but this is the only way to prevent
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
-- it from downloading the whole playlist.)
|
2017-11-30 20:08:30 +00:00
|
|
|
, Param "--playlist-items", Param "0"
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
] ++
|
2023-06-19 18:23:14 +00:00
|
|
|
if isytdlp cmd
|
2024-02-19 22:35:57 +00:00
|
|
|
then
|
|
|
|
-- Avoid warnings, which go to
|
|
|
|
-- stderr and may mess up
|
|
|
|
-- git-annex's display.
|
|
|
|
[ Param "--no-warnings"
|
|
|
|
, Param "--progress-template"
|
2023-06-19 18:23:14 +00:00
|
|
|
, Param progressTemplate
|
|
|
|
, Param "--print-to-file"
|
|
|
|
, Param "after_move:filepath"
|
|
|
|
, Param filelistfilebase
|
|
|
|
]
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
else []
|
2017-11-30 20:08:30 +00:00
|
|
|
|
|
|
|
-- To honor annex.diskreserve, ask youtube-dl to not download too
|
|
|
|
-- large a media file. Factors in other downloads that are in progress,
|
|
|
|
-- and any files in the workdir that it may have partially downloaded
|
|
|
|
-- before.
|
|
|
|
youtubeDlMaxSize :: FilePath -> Annex (Either String [CommandParam])
|
2022-06-28 19:28:14 +00:00
|
|
|
youtubeDlMaxSize workdir = ifM (Annex.getRead Annex.force)
|
2017-11-30 20:08:30 +00:00
|
|
|
( return $ Right []
|
|
|
|
, liftIO (getDiskFree workdir) >>= \case
|
|
|
|
Just have -> do
|
|
|
|
inprogress <- sizeOfDownloadsInProgress (const True)
|
|
|
|
partial <- liftIO $ sum
|
2020-11-05 15:26:34 +00:00
|
|
|
<$> (mapM (getFileSize . toRawFilePath) =<< dirContents workdir)
|
2017-11-30 20:08:30 +00:00
|
|
|
reserve <- annexDiskReserve <$> Annex.getGitConfig
|
|
|
|
let maxsize = have - reserve - inprogress + partial
|
|
|
|
if maxsize > 0
|
|
|
|
then return $ Right
|
|
|
|
[ Param "--max-filesize"
|
|
|
|
, Param (show maxsize)
|
|
|
|
]
|
|
|
|
else return $ Left $
|
|
|
|
needMoreDiskSpace $
|
|
|
|
negate maxsize + 1024
|
|
|
|
Nothing -> return $ Right []
|
|
|
|
)
|
2017-11-29 19:49:05 +00:00
|
|
|
|
|
|
|
-- Download a media file to a destination,
|
2020-09-29 21:53:48 +00:00
|
|
|
youtubeDlTo :: Key -> URLString -> FilePath -> MeterUpdate -> Annex Bool
|
|
|
|
youtubeDlTo key url dest p = do
|
2017-12-05 19:00:50 +00:00
|
|
|
res <- withTmpWorkDir key $ \workdir ->
|
2020-10-30 17:07:41 +00:00
|
|
|
youtubeDl url (fromRawFilePath workdir) p >>= \case
|
2017-11-29 19:49:05 +00:00
|
|
|
Right (Just mediafile) -> do
|
2022-12-20 19:17:50 +00:00
|
|
|
liftIO $ moveFile (toRawFilePath mediafile) (toRawFilePath dest)
|
2017-11-30 17:45:43 +00:00
|
|
|
return (Just True)
|
|
|
|
Right Nothing -> return (Just False)
|
|
|
|
Left msg -> do
|
filter out control characters in warning messages
Converted warning and similar to use StringContainingQuotedPath. Most
warnings are static strings, some do refer to filepaths that need to be
quoted, and others don't need quoting.
Note that, since quote filters out control characters of even
UnquotedString, this makes all warnings safe, even when an attacker
sneaks in a control character in some other way.
When json is being output, no quoting is done, since json gets its own
quoting.
This does, as a side effect, make warning messages in json output not
be indented. The indentation is only needed to offset warning messages
underneath the display of the file they apply to, so that's ok.
Sponsored-by: Brett Eisenberg on Patreon
2023-04-10 18:47:32 +00:00
|
|
|
warning (UnquotedString msg)
|
2017-11-30 17:45:43 +00:00
|
|
|
return Nothing
|
|
|
|
return (fromMaybe False res)
|
2017-11-29 19:49:05 +00:00
|
|
|
|
2017-12-06 17:16:06 +00:00
|
|
|
-- youtube-dl supports downloading urls that are not html pages,
|
|
|
|
-- but we don't want to use it for such urls, since they can be downloaded
|
|
|
|
-- without it. So, this first downloads part of the content and checks
|
|
|
|
-- if it's a html page; only then is youtube-dl used.
|
2017-12-08 18:49:55 +00:00
|
|
|
htmlOnly :: URLString -> a -> Annex a -> Annex a
|
2018-04-04 19:00:51 +00:00
|
|
|
htmlOnly url fallback a = withUrlOptions $ \uo ->
|
2017-12-06 17:16:06 +00:00
|
|
|
liftIO (downloadPartial url uo htmlPrefixLength) >>= \case
|
2017-12-08 18:49:55 +00:00
|
|
|
Just bs | isHtmlBs bs -> a
|
|
|
|
_ -> return fallback
|
|
|
|
|
2018-06-28 17:01:18 +00:00
|
|
|
-- Check if youtube-dl supports downloading content from an url.
|
2017-12-08 18:49:55 +00:00
|
|
|
youtubeDlSupported :: URLString -> Annex Bool
|
2018-06-28 17:01:18 +00:00
|
|
|
youtubeDlSupported url = either (const False) id
|
|
|
|
<$> withUrlOptions (youtubeDlCheck' url)
|
2017-12-08 18:49:55 +00:00
|
|
|
|
|
|
|
-- Check if youtube-dl can find media in an url.
|
2018-06-28 17:01:18 +00:00
|
|
|
--
|
|
|
|
-- While this does not download anything, it checks youtubeDlAllowed
|
|
|
|
-- for symmetry with youtubeDl; the check should not succeed if the
|
|
|
|
-- download won't succeed.
|
2017-12-08 18:49:55 +00:00
|
|
|
youtubeDlCheck :: URLString -> Annex (Either String Bool)
|
2018-06-17 18:46:22 +00:00
|
|
|
youtubeDlCheck url = ifM youtubeDlAllowed
|
|
|
|
( withUrlOptions $ youtubeDlCheck' url
|
2018-06-28 17:01:18 +00:00
|
|
|
, return $ Left youtubeDlNotAllowedMessage
|
2018-06-17 18:46:22 +00:00
|
|
|
)
|
limit url downloads to whitelisted schemes
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
2018-06-15 20:52:24 +00:00
|
|
|
|
|
|
|
youtubeDlCheck' :: URLString -> UrlOptions -> Annex (Either String Bool)
|
|
|
|
youtubeDlCheck' url uo
|
|
|
|
| supportedScheme uo url = catchMsgIO $ htmlOnly url False $ do
|
2017-12-11 16:46:34 +00:00
|
|
|
opts <- youtubeDlOpts [ Param url, Param "--simulate" ]
|
2021-08-27 13:44:23 +00:00
|
|
|
cmd <- youtubeDlCommand
|
|
|
|
liftIO $ snd <$> processTranscript cmd (toCommand opts) Nothing
|
2017-12-11 16:46:34 +00:00
|
|
|
| otherwise = return (Right False)
|
2017-11-30 18:35:25 +00:00
|
|
|
|
|
|
|
-- Ask youtube-dl for the filename of media in an url.
|
|
|
|
--
|
|
|
|
-- (This is not always identical to the filename it uses when downloading.)
|
|
|
|
youtubeDlFileName :: URLString -> Annex (Either String FilePath)
|
2018-06-28 17:01:18 +00:00
|
|
|
youtubeDlFileName url = withUrlOptions go
|
2017-12-31 18:55:51 +00:00
|
|
|
where
|
limit url downloads to whitelisted schemes
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
2018-06-15 20:52:24 +00:00
|
|
|
go uo
|
|
|
|
| supportedScheme uo url = flip catchIO (pure . Left . show) $
|
|
|
|
htmlOnly url nomedia (youtubeDlFileNameHtmlOnly' url uo)
|
|
|
|
| otherwise = return nomedia
|
2017-12-31 18:55:51 +00:00
|
|
|
nomedia = Left "no media in url"
|
|
|
|
|
|
|
|
-- Does not check if the url contains htmlOnly; use when that's already
|
|
|
|
-- been verified.
|
limit url downloads to whitelisted schemes
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
2018-06-15 20:52:24 +00:00
|
|
|
youtubeDlFileNameHtmlOnly :: URLString -> Annex (Either String FilePath)
|
2018-06-28 17:01:18 +00:00
|
|
|
youtubeDlFileNameHtmlOnly = withUrlOptions . youtubeDlFileNameHtmlOnly'
|
limit url downloads to whitelisted schemes
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
2018-06-15 20:52:24 +00:00
|
|
|
|
|
|
|
youtubeDlFileNameHtmlOnly' :: URLString -> UrlOptions -> Annex (Either String FilePath)
|
|
|
|
youtubeDlFileNameHtmlOnly' url uo
|
|
|
|
| supportedScheme uo url = flip catchIO (pure . Left . show) go
|
2017-12-11 16:46:34 +00:00
|
|
|
| otherwise = return nomedia
|
|
|
|
where
|
|
|
|
go = do
|
2017-12-08 18:49:55 +00:00
|
|
|
-- Sometimes youtube-dl will fail with an ugly backtrace
|
|
|
|
-- (eg, http://bugs.debian.org/874321)
|
|
|
|
-- so catch stderr as well as stdout to avoid the user
|
|
|
|
-- seeing it. --no-warnings avoids warning messages that
|
|
|
|
-- are output to stdout.
|
|
|
|
opts <- youtubeDlOpts
|
|
|
|
[ Param url
|
|
|
|
, Param "--get-filename"
|
|
|
|
, Param "--no-warnings"
|
2018-11-28 21:14:47 +00:00
|
|
|
, Param "--no-playlist"
|
2017-12-08 18:49:55 +00:00
|
|
|
]
|
2021-08-27 13:44:23 +00:00
|
|
|
cmd <- youtubeDlCommand
|
|
|
|
let p = (proc cmd (toCommand opts))
|
2020-06-04 16:13:26 +00:00
|
|
|
{ std_out = CreatePipe
|
|
|
|
, std_err = CreatePipe
|
|
|
|
}
|
|
|
|
liftIO $ withCreateProcess p waitproc
|
|
|
|
|
|
|
|
waitproc Nothing (Just o) (Just e) pid = do
|
2020-11-19 20:21:17 +00:00
|
|
|
errt <- async $ discardstderr pid e
|
|
|
|
output <- hGetContentsStrict o
|
2017-12-31 19:19:01 +00:00
|
|
|
ok <- liftIO $ checkSuccessProcess pid
|
2020-11-19 20:21:17 +00:00
|
|
|
wait errt
|
2017-12-08 18:49:55 +00:00
|
|
|
return $ case (ok, lines output) of
|
|
|
|
(True, (f:_)) | not (null f) -> Right f
|
|
|
|
_ -> nomedia
|
2020-06-04 16:13:26 +00:00
|
|
|
waitproc _ _ _ _ = error "internal"
|
|
|
|
|
2020-11-19 20:21:17 +00:00
|
|
|
discardstderr pid e = hGetLineUntilExitOrEOF pid e >>= \case
|
|
|
|
Nothing -> return ()
|
|
|
|
Just _ -> discardstderr pid e
|
|
|
|
|
2017-12-08 18:49:55 +00:00
|
|
|
nomedia = Left "no media in url"
|
2017-11-30 18:35:25 +00:00
|
|
|
|
|
|
|
youtubeDlOpts :: [CommandParam] -> Annex [CommandParam]
|
|
|
|
youtubeDlOpts addopts = do
|
2017-11-30 00:07:03 +00:00
|
|
|
opts <- map Param . annexYoutubeDlOptions <$> Annex.getGitConfig
|
2017-11-30 18:35:25 +00:00
|
|
|
return (opts ++ addopts)
|
2017-12-11 16:46:34 +00:00
|
|
|
|
2021-08-27 13:44:23 +00:00
|
|
|
youtubeDlCommand :: Annex String
|
2022-11-21 18:39:26 +00:00
|
|
|
youtubeDlCommand = annexYoutubeDlCommand <$> Annex.getGitConfig >>= \case
|
|
|
|
Just c -> pure c
|
2023-06-20 18:55:25 +00:00
|
|
|
Nothing -> ifM (liftIO $ inSearchPath "yt-dlp")
|
|
|
|
( return "yt-dlp"
|
|
|
|
, return "youtube-dl"
|
|
|
|
)
|
2021-08-27 13:44:23 +00:00
|
|
|
|
limit url downloads to whitelisted schemes
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
2018-06-15 20:52:24 +00:00
|
|
|
supportedScheme :: UrlOptions -> URLString -> Bool
|
|
|
|
supportedScheme uo url = case parseURIRelaxed url of
|
2017-12-11 16:46:34 +00:00
|
|
|
Nothing -> False
|
limit url downloads to whitelisted schemes
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
2018-06-15 20:52:24 +00:00
|
|
|
Just u -> case uriScheme u of
|
|
|
|
-- avoid ugly message from youtube-dl about not supporting file:
|
|
|
|
"file:" -> False
|
|
|
|
-- ftp indexes may look like html pages, and there's no point
|
|
|
|
-- involving youtube-dl in a ftp download
|
|
|
|
"ftp:" -> False
|
|
|
|
_ -> allowedScheme uo u
|
2020-09-29 21:53:48 +00:00
|
|
|
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
progressTemplate :: String
|
|
|
|
progressTemplate = "ANNEX %(progress.downloaded_bytes)i %(progress.total_bytes_estimate)i %(progress.total_bytes)i ANNEX"
|
|
|
|
|
|
|
|
{- The progressTemplate makes output look like "ANNEX 10 100 NA ANNEX" or
|
|
|
|
- "ANNEX 10 NA 100 ANNEX" depending on whether the total bytes are estimated
|
|
|
|
- or known. That makes parsing much easier (and less fragile) than parsing
|
|
|
|
- the usual progress output.
|
2020-09-29 21:53:48 +00:00
|
|
|
-}
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
parseYtdlpProgress :: ProgressParser
|
|
|
|
parseYtdlpProgress = go [] . reverse . progresschunks
|
2020-09-29 21:53:48 +00:00
|
|
|
where
|
|
|
|
delim = '\r'
|
|
|
|
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
progresschunks = splitc delim
|
2020-09-29 21:53:48 +00:00
|
|
|
|
|
|
|
go remainder [] = (Nothing, Nothing, remainder)
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
go remainder (x:xs) = case splitc ' ' x of
|
|
|
|
("ANNEX":downloaded_bytes_s:total_bytes_estimate_s:total_bytes_s:"ANNEX":[]) ->
|
|
|
|
case (readMaybe downloaded_bytes_s, readMaybe total_bytes_estimate_s, readMaybe total_bytes_s) of
|
|
|
|
(Just downloaded_bytes, Nothing, Just total_bytes) ->
|
|
|
|
( Just (BytesProcessed downloaded_bytes)
|
|
|
|
, Just (TotalSize total_bytes)
|
|
|
|
, remainder
|
|
|
|
)
|
|
|
|
(Just downloaded_bytes, Just total_bytes_estimate, _) ->
|
|
|
|
( Just (BytesProcessed downloaded_bytes)
|
|
|
|
, Just (TotalSize total_bytes_estimate)
|
|
|
|
, remainder
|
|
|
|
)
|
|
|
|
_ -> go (remainder++x) xs
|
|
|
|
_ -> go (remainder++x) xs
|
2020-09-29 21:53:48 +00:00
|
|
|
|
default to yt-dlp and fix progress parsing bugs
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
2023-05-27 16:45:16 +00:00
|
|
|
{- youtube-dl is deprecated, parsing its progress was attempted before but
|
|
|
|
- was buggy and is no longer done. -}
|
|
|
|
parseYoutubeDlProgress :: ProgressParser
|
|
|
|
parseYoutubeDlProgress _ = (Nothing, Nothing, "")
|
2024-01-30 19:37:29 +00:00
|
|
|
|
|
|
|
{- List the items that yt-dlp can download from an url.
|
|
|
|
-
|
|
|
|
- Note that this does not check youtubeDlAllowed because it does not
|
|
|
|
- download content.
|
|
|
|
-}
|
|
|
|
youtubePlaylist :: URLString -> Annex (Either String [YoutubePlaylistItem])
|
|
|
|
youtubePlaylist url = do
|
|
|
|
cmd <- youtubeDlCommand
|
|
|
|
if cmd == "yt-dlp"
|
|
|
|
then liftIO $ youtubePlaylist' url cmd
|
|
|
|
else return $ Left $ "Scraping needs yt-dlp, but git-annex has been configured to use " ++ cmd
|
|
|
|
|
|
|
|
youtubePlaylist' :: URLString -> String -> IO (Either String [YoutubePlaylistItem])
|
|
|
|
youtubePlaylist' url cmd = withTmpFile "yt-dlp" $ \tmpfile h -> do
|
|
|
|
hClose h
|
|
|
|
(outerr, ok) <- processTranscript cmd
|
|
|
|
[ "--simulate"
|
|
|
|
, "--flat-playlist"
|
|
|
|
-- Skip live videos in progress
|
|
|
|
, "--match-filter", "!is_live"
|
|
|
|
, "--print-to-file"
|
|
|
|
-- Write json with selected fields.
|
|
|
|
, "%(.{" ++ intercalate "," youtubePlaylistItemFields ++ "})j"
|
|
|
|
, tmpfile
|
|
|
|
, url
|
|
|
|
]
|
|
|
|
Nothing
|
|
|
|
if ok
|
|
|
|
then flip catchIO (pure . Left . show) $ do
|
|
|
|
v <- map Aeson.eitherDecodeStrict . B8.lines
|
|
|
|
<$> B.readFile tmpfile
|
|
|
|
return $ case partitionEithers v of
|
|
|
|
((parserr:_), _) ->
|
2024-04-06 13:50:58 +00:00
|
|
|
Left $ "yt-dlp json parse error: " ++ parserr
|
2024-01-30 19:37:29 +00:00
|
|
|
([], r) -> Right r
|
|
|
|
else return $ Left $ if null outerr
|
|
|
|
then "yt-dlp failed"
|
|
|
|
else "yt-dlp failed: " ++ outerr
|
|
|
|
|
|
|
|
-- There are other fields that yt-dlp can extract, but these are similar to
|
|
|
|
-- the information from an RSS feed.
|
|
|
|
youtubePlaylistItemFields :: [String]
|
|
|
|
youtubePlaylistItemFields =
|
|
|
|
[ "playlist_title"
|
|
|
|
, "playlist_uploader"
|
|
|
|
, "title"
|
|
|
|
, "description"
|
|
|
|
, "license"
|
|
|
|
, "url"
|
|
|
|
, "timestamp"
|
|
|
|
]
|
|
|
|
|
|
|
|
-- Parse JSON generated by yt-dlp for playlist. Note that any field
|
|
|
|
-- may be omitted when that information is not supported for a given website.
|
|
|
|
data YoutubePlaylistItem = YoutubePlaylistItem
|
|
|
|
{ youtube_playlist_title :: Maybe String
|
|
|
|
, youtube_playlist_uploader :: Maybe String
|
|
|
|
, youtube_title :: Maybe String
|
|
|
|
, youtube_description :: Maybe String
|
|
|
|
, youtube_license :: Maybe String
|
|
|
|
, youtube_url :: Maybe String
|
|
|
|
, youtube_timestamp :: Maybe Integer -- ^ unix timestamp
|
|
|
|
} deriving (Generic, Show)
|
|
|
|
|
|
|
|
instance Aeson.FromJSON YoutubePlaylistItem
|
|
|
|
where
|
|
|
|
parseJSON = Aeson.genericParseJSON Aeson.defaultOptions
|
|
|
|
{ Aeson.fieldLabelModifier = drop (length "youtube_") }
|
|
|
|
|