default to yt-dlp and fix progress parsing bugs

I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.

Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.

So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.

git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.

Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.

Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?

Sponsored-by: Joshua Antonishen on Patreon
This commit is contained in:
Joey Hess 2023-05-27 12:45:16 -04:00
parent f1cdb79ca4
commit f2db6da938
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
7 changed files with 79 additions and 73 deletions

View file

@ -1,6 +1,6 @@
{- youtube-dl integration for git-annex {- yt-dlp (and deprecated youtube-dl) integration for git-annex
- -
- Copyright 2017-2021 Joey Hess <id@joeyh.name> - Copyright 2017-2023 Joey Hess <id@joeyh.name>
- -
- Licensed under the GNU AGPL version 3 or higher. - Licensed under the GNU AGPL version 3 or higher.
-} -}
@ -22,13 +22,11 @@ import Utility.DiskFree
import Utility.HtmlDetect import Utility.HtmlDetect
import Utility.Process.Transcript import Utility.Process.Transcript
import Utility.Metered import Utility.Metered
import Utility.DataUnits
import Messages.Progress import Messages.Progress
import Logs.Transfer import Logs.Transfer
import Network.URI import Network.URI
import Control.Concurrent.Async import Control.Concurrent.Async
import Data.Char
import Text.Read import Text.Read
-- youtube-dl can follow redirects to anywhere, including potentially -- youtube-dl can follow redirects to anywhere, including potentially
@ -39,10 +37,10 @@ youtubeDlAllowed = ipAddressesUnlimited
youtubeDlNotAllowedMessage :: String youtubeDlNotAllowedMessage :: String
youtubeDlNotAllowedMessage = unwords youtubeDlNotAllowedMessage = unwords
[ "This url is supported by youtube-dl, but" [ "This url is supported by yt-dlp, but"
, "youtube-dl could potentially access any address, and the" , "yt-dlp could potentially access any address, and the"
, "configuration of annex.security.allowed-ip-addresses" , "configuration of annex.security.allowed-ip-addresses"
, "does not allow that. Not using youtube-dl." , "does not allow that. Not using yt-dlp (or youtube-dl)."
] ]
-- Runs youtube-dl in a work directory, to download a single media file -- Runs youtube-dl in a work directory, to download a single media file
@ -76,20 +74,21 @@ youtubeDl' url workdir p uo
fs -> return (toomanyfiles fs) fs -> return (toomanyfiles fs)
Right False -> workdirfiles >>= \case Right False -> workdirfiles >>= \case
[] -> return (Right Nothing) [] -> return (Right Nothing)
_ -> return (Left "youtube-dl download is incomplete. Run the command again to resume.") _ -> return (Left "yt-dlp download is incomplete. Run the command again to resume.")
Left msg -> return (Left msg) Left msg -> return (Left msg)
, return (Right Nothing) , return (Right Nothing)
) )
| otherwise = return (Right Nothing) | otherwise = return (Right Nothing)
where where
nofiles = Left "youtube-dl did not put any media in its work directory, perhaps it's been configured to store files somewhere else?" nofiles = Left "yt-dlp did not put any media in its work directory, perhaps it's been configured to store files somewhere else?"
toomanyfiles fs = Left $ "youtube-dl downloaded multiple media files; git-annex is only able to deal with one per url: " ++ show fs toomanyfiles fs = Left $ "yt-dlp downloaded multiple media files; git-annex is only able to deal with one per url: " ++ show fs
workdirfiles = liftIO $ filterM (doesFileExist) =<< dirContents workdir workdirfiles = liftIO $ filterM (doesFileExist) =<< dirContents workdir
runcmd = youtubeDlMaxSize workdir >>= \case runcmd = youtubeDlMaxSize workdir >>= \case
Left msg -> return (Left msg) Left msg -> return (Left msg)
Right maxsize -> do Right maxsize -> do
cmd <- youtubeDlCommand cmd <- youtubeDlCommand
opts <- youtubeDlOpts (dlopts ++ maxsize) let isytdlp = "yt-dlp" `isInfixOf` cmd
opts <- youtubeDlOpts (dlopts isytdlp ++ maxsize)
oh <- mkOutputHandlerQuiet oh <- mkOutputHandlerQuiet
-- The size is unknown to start. Once youtube-dl -- The size is unknown to start. Once youtube-dl
-- outputs some progress, the meter will be updated -- outputs some progress, the meter will be updated
@ -97,21 +96,25 @@ youtubeDl' url workdir p uo
-- meter is passed into commandMeter' -- meter is passed into commandMeter'
let unknownsize = Nothing :: Maybe FileSize let unknownsize = Nothing :: Maybe FileSize
ok <- metered (Just p) unknownsize Nothing $ \meter meterupdate -> ok <- metered (Just p) unknownsize Nothing $ \meter meterupdate ->
liftIO $ commandMeter' liftIO $ commandMeter'
parseYoutubeDlProgress oh (Just meter) meterupdate cmd opts (if isytdlp then parseYtdlpProgress else parseYoutubeDlProgress)
oh (Just meter) meterupdate cmd opts
(\pr -> pr { cwd = Just workdir }) (\pr -> pr { cwd = Just workdir })
return (Right ok) return (Right ok)
dlopts = dlopts isytdlp =
[ Param url [ Param url
-- To make youtube-dl only download one file when given a -- To make it only download one file when given a
-- page with a video and a playlist, download only the video. -- page with a video and a playlist, download only the video.
, Param "--no-playlist" , Param "--no-playlist"
-- And when given a page with only a playlist, download only -- And when given a page with only a playlist, download only
-- the first video on the playlist. (Assumes the video is -- the first video on the playlist. (Assumes the video is
-- somewhat stable, but this is the only way to prevent -- somewhat stable, but this is the only way to prevent
-- youtube-dl from downloading the whole playlist.) -- it from downloading the whole playlist.)
, Param "--playlist-items", Param "0" , Param "--playlist-items", Param "0"
] ] ++
if isytdlp
then [Param "--progress-template", Param progressTemplate]
else []
-- To honor annex.diskreserve, ask youtube-dl to not download too -- To honor annex.diskreserve, ask youtube-dl to not download too
-- large a media file. Factors in other downloads that are in progress, -- large a media file. Factors in other downloads that are in progress,
@ -251,7 +254,7 @@ youtubeDlOpts addopts = do
youtubeDlCommand :: Annex String youtubeDlCommand :: Annex String
youtubeDlCommand = annexYoutubeDlCommand <$> Annex.getGitConfig >>= \case youtubeDlCommand = annexYoutubeDlCommand <$> Annex.getGitConfig >>= \case
Just c -> pure c Just c -> pure c
Nothing -> fromMaybe "yt-dlp" <$> liftIO (searchPath "youtube-dl") Nothing -> fromMaybe "youtube-dl" <$> liftIO (searchPath "yt-dlp")
supportedScheme :: UrlOptions -> URLString -> Bool supportedScheme :: UrlOptions -> URLString -> Bool
supportedScheme uo url = case parseURIRelaxed url of supportedScheme uo url = case parseURIRelaxed url of
@ -264,41 +267,39 @@ supportedScheme uo url = case parseURIRelaxed url of
"ftp:" -> False "ftp:" -> False
_ -> allowedScheme uo u _ -> allowedScheme uo u
{- Strategy: Look for chunks prefixed with \r, which look approximately progressTemplate :: String
- like this for youtube-dl: progressTemplate = "ANNEX %(progress.downloaded_bytes)i %(progress.total_bytes_estimate)i %(progress.total_bytes)i ANNEX"
- "ESC[K[download] 26.6% of 60.22MiB at 254.69MiB/s ETA 00:00"
- or for yt-dlp, like this: {- The progressTemplate makes output look like "ANNEX 10 100 NA ANNEX" or
- "\r[download] 1.8% of 1.14GiB at 1.04MiB/s ETA 18:23" - "ANNEX 10 NA 100 ANNEX" depending on whether the total bytes are estimated
- Look at the number before "% of " and the number and unit after, - or known. That makes parsing much easier (and less fragile) than parsing
- to determine the number of bytes. - the usual progress output.
-} -}
parseYoutubeDlProgress :: ProgressParser parseYtdlpProgress :: ProgressParser
parseYoutubeDlProgress = go [] . reverse . progresschunks parseYtdlpProgress = go [] . reverse . progresschunks
where where
delim = '\r' delim = '\r'
progresschunks = drop 1 . splitc delim progresschunks = splitc delim
go remainder [] = (Nothing, Nothing, remainder) go remainder [] = (Nothing, Nothing, remainder)
go remainder (x:xs) = case split "% of " x of go remainder (x:xs) = case splitc ' ' x of
(p:r:[]) -> case (parsepercent p, parsebytes r) of ("ANNEX":downloaded_bytes_s:total_bytes_estimate_s:total_bytes_s:"ANNEX":[]) ->
(Just percent, Just total) -> case (readMaybe downloaded_bytes_s, readMaybe total_bytes_estimate_s, readMaybe total_bytes_s) of
( Just (toBytesProcessed (calc percent total)) (Just downloaded_bytes, Nothing, Just total_bytes) ->
, Just (TotalSize total) ( Just (BytesProcessed downloaded_bytes)
, remainder , Just (TotalSize total_bytes)
) , remainder
_ -> go (delim:x++remainder) xs )
_ -> go (delim:x++remainder) xs (Just downloaded_bytes, Just total_bytes_estimate, _) ->
( Just (BytesProcessed downloaded_bytes)
, Just (TotalSize total_bytes_estimate)
, remainder
)
_ -> go (remainder++x) xs
_ -> go (remainder++x) xs
calc :: Double -> Integer -> Integer {- youtube-dl is deprecated, parsing its progress was attempted before but
calc percent total = round (percent * fromIntegral total / 100) - was buggy and is no longer done. -}
parseYoutubeDlProgress :: ProgressParser
parsepercent :: String -> Maybe Double parseYoutubeDlProgress _ = (Nothing, Nothing, "")
parsepercent = readMaybe
. reverse . takeWhile (not . isSpace) . reverse
. dropWhile isSpace
parsebytes = readSize units . takeWhile (not . isSpace)
. dropWhile isSpace
units = committeeUnits ++ storageUnits

View file

@ -67,6 +67,13 @@ git-annex (10.20230408) UNRELEASED; urgency=medium
* sync: Added -g as a short option for --no-content. * sync: Added -g as a short option for --no-content.
* Fix bug in -z handling of trailing NUL in input. * Fix bug in -z handling of trailing NUL in input.
* version: Avoid error message when entire output is not read. * version: Avoid error message when entire output is not read.
* Fix excessive CPU usage when parsing yt-dlp (or youtube-dl) progress
output fails.
* Use --progress-template with yt-dlp to fix a failure to parse
progress output when only an estimated total size is known.
* When yt-dlp is available, default to using it in preference to
youtube-dl. Using youtube-dl is now deprecated, and git-annex no longer
tries to parse its output to display download progress
-- Joey Hess <id@joeyh.name> Sat, 08 Apr 2023 13:57:18 -0400 -- Joey Hess <id@joeyh.name> Sat, 08 Apr 2023 13:57:18 -0400

View file

@ -10,7 +10,7 @@ git annex addurl `[url ...]`
Downloads each url to its own file, which is added to the annex. Downloads each url to its own file, which is added to the annex.
When `youtube-dl` is installed, it can be used to check for a video When `yt-dlp` is installed, it can be used to check for a video
embedded in a web page at the url, and that is added to the annex instead. embedded in a web page at the url, and that is added to the annex instead.
(However, this is disabled by default as it can be a security risk. (However, this is disabled by default as it can be a security risk.
See the documentation of annex.security.allowed-ip-addresses See the documentation of annex.security.allowed-ip-addresses
@ -45,13 +45,13 @@ be used to get better filenames.
* `--raw` * `--raw`
Prevent special handling of urls by youtube-dl, bittorrent, and other Prevent special handling of urls by yt-dlp, bittorrent, and other
special remotes. This will for example, make addurl special remotes. This will for example, make addurl
download the .torrent file and not the contents it points to. download the .torrent file and not the contents it points to.
* `--no-raw` * `--no-raw`
Require content pointed to by the url to be downloaded using youtube-dl Require content pointed to by the url to be downloaded using yt-dlp
or a special remote, rather than the raw content of the url. if that or a special remote, rather than the raw content of the url. if that
cannot be done, the add will fail. cannot be done, the add will fail.

View file

@ -13,7 +13,7 @@ content has not already been added to the repository before, so you can
delete, rename, etc the resulting files and repeated runs won't duplicate delete, rename, etc the resulting files and repeated runs won't duplicate
them. them.
When `youtube-dl` is installed, it can be used to download links in the feed. When `yt-dlp` is installed, it can be used to download links in the feed.
This allows importing e.g., YouTube playlists. This allows importing e.g., YouTube playlists.
(However, this is disabled by default as it can be a security risk. (However, this is disabled by default as it can be a security risk.
See the documentation of annex.security.allowed-ip-addresses See the documentation of annex.security.allowed-ip-addresses
@ -54,13 +54,13 @@ resulting in the new url being downloaded to such a filename.
* `--raw` * `--raw`
Prevent special handling of urls by youtube-dl, bittorrent, and other Prevent special handling of urls by yt-dlp, bittorrent, and other
special remotes. This will for example, make importfeed special remotes. This will for example, make importfeed
download a .torrent file and not the contents it points to. download a .torrent file and not the contents it points to.
* `--no-raw` * `--no-raw`
Require content pointed to by the url to be downloaded using youtube-dl Require content pointed to by the url to be downloaded using yt-dlp
or a special remote, rather than the raw content of the url. if that or a special remote, rather than the raw content of the url. if that
cannot be done, the import will fail, and the next import of the feed cannot be done, the import will fail, and the next import of the feed
will retry. will retry.

View file

@ -1774,19 +1774,19 @@ Remotes are configured using these settings in `.git/config`.
* `annex.youtube-dl-options` * `annex.youtube-dl-options`
Options to pass to youtube-dl (or yt-dlp) when using it to find the url Options to pass to yt-dlp (or deprecated youtube-dl) when using it to
to download for a video. find the url to download for a video.
Some options may break git-annex's integration with youtube-dl. For Some options may break git-annex's integration with yt-dlp. For
example, the --output option could cause it to store files somewhere example, the --output option could cause it to store files somewhere
git-annex won't find them. Avoid setting here or in the youtube-dl config git-annex won't find them. Avoid setting here or in the yt-dlp config
file any options that cause youtube-dl to download more than one file, file any options that cause it to download more than one file,
or to store the file anywhere other than the current working directory. or to store the file anywhere other than the current working directory.
* `annex.youtube-dl-command` * `annex.youtube-dl-command`
Command to run for youtube-dl. Default is to use "youtube-dl" or Default is to use "yt-dlp" or if that is not available in the PATH,
if that is not available in the PATH, to use "yt-dlp". to use "youtube-dl".
* `annex.aria-torrent-options` * `annex.aria-torrent-options`
@ -1837,8 +1837,8 @@ Remotes are configured using these settings in `.git/config`.
causing it to be downloaded into your repository and transferred to causing it to be downloaded into your repository and transferred to
other remotes, exposing its content. other remotes, exposing its content.
Note that, since the interfaces of curl and youtube-dl do not allow Note that, since the interfaces of curl and yt-dlp do not allow
these IP address restrictions to be enforced, curl and youtube-dl will these IP address restrictions to be enforced, curl and yt-dlp will
never be used unless annex.security.allowed-ip-addresses=all. never be used unless annex.security.allowed-ip-addresses=all.
To allow accessing local or private IP addresses on only specific ports, To allow accessing local or private IP addresses on only specific ports,

View file

@ -74,7 +74,7 @@ and transferring to your laptop on demand.
## youtube channels ## youtube channels
You can also use `git annex importfeed` on youtube channels. You can also use `git annex importfeed` on youtube channels.
It will use youtube-dl to automatically It will use yt-dlp to automatically
download the videos. download the videos.
To download a youtube channel, you need to find the feed associated with that To download a youtube channel, you need to find the feed associated with that
@ -84,7 +84,7 @@ manually. For a channel url like
"https://www.youtube.com/channel/$foo", the "https://www.youtube.com/channel/$foo", the
feed is "https://www.youtube.com/feeds/videos.xml?channel_id=$foo" feed is "https://www.youtube.com/feeds/videos.xml?channel_id=$foo"
Use of youtube-dl is disabled by default as it can be a security risk. Use of yt-dlp is disabled by default as it can be a security risk.
See the documentation of annex.security.allowed-ip-addresses See the documentation of annex.security.allowed-ip-addresses
in [[git-annex]] for details.) in [[git-annex]] for details.)

View file

@ -75,9 +75,9 @@ number takes that many paths from the end.
<a name=videos></a> <a name=videos></a>
There's support for downloading videos from sites like YouTube, Vimeo, There's support for downloading videos from sites like YouTube, Vimeo,
and many more. This relies on youtube-dl to download the videos. and many more. This relies on yt-dlp to download the videos.
When you have youtube-dl installed, you can just When you have yt-dlp installed, you can just
`git annex addurl http://youtube.com/foo` and it will detect that `git annex addurl http://youtube.com/foo` and it will detect that
it is a video and download the video content for offline viewing. it is a video and download the video content for offline viewing.
@ -86,16 +86,14 @@ See the documentation of annex.security.allowed-ip-addresses
in [[git-annex]] for details.) in [[git-annex]] for details.)
Later, in another clone of the repository, you can run `git annex get` on Later, in another clone of the repository, you can run `git annex get` on
the file and it will also be downloaded with youtube-dl. This works the file and it will also be downloaded with yt-dlp. This works
even if the video host has transcoded or otherwise changed the video even if the video host has transcoded or otherwise changed the video
in the meantime; the assumption is that these video files are equivalent. in the meantime; the assumption is that these video files are equivalent.
There is an `annex.youtube-dl-options` configuration setting that can be used There is an `annex.youtube-dl-options` configuration setting that can be used
to pass parameters to quvi. For example, you could set `git config to pass parameters to yt-dlp. For example, you could set `git config
annex.youtube-dl-options "--format worst"` to configure it to download low annex.youtube-dl-options "--format worst"` to configure it to download low
quality videos from YouTube. Note that the youtube-dl configuration files quality videos from YouTube.
are not read when git-annex runs youtube-dl, to avoid config settings that
break its integration.
To download a youtube channel, you need to find the RSS feed associated with To download a youtube channel, you need to find the RSS feed associated with
that channel, and pass it to `git annex importfeed`. There does not seem to that channel, and pass it to `git annex importfeed`. There does not seem to