f2db6da938
I noticed git-annex was using a lot of CPU when downloading from youtube, and was not displaying progress. Turns out that yt-dlp (and I think also youtube-dl) sometimes only knows an estimated size, not the actual size, and displays the progress output slightly differently for that. That broke the parser. And, the parser was feeding chunks that failed to parse back as a remainder, which caused it to try to re-parse the entire output each time, so it got slower and slower. Using --progress-template like this should avoid parsing problems as well as future proof against output changes. But it will work with only yt-dlp. So, this seemed like the right time to deprecate youtube-dl, and default to yt-dlp when available. git-annex will still use youtube-dl if that's all that's available. However, since the progress parser for youtube-dl was buggy, and I don't want to maintain two different progress parsers (especially since youtube-dl is no longer in debian unstable having been replaced by yt-dlp), made git-annex no longer try to parse youtube-dl's progress. Also, updated docs for yt-dlp being default. It did not seem worth renaming annex.youtube-dl-options and annex.youtube-dl-command. Note that yt-dlp does not seem to document the fields available in the progress template. I found them by reading the source and looking at the templates it uses internally. Also note that the use of "i" (rather than "s") in progressTemplate makes it display floats rounded to integers; particularly the estimated total size can be a float. That also does not seem to be documented but I assume is a python thing? Sponsored-by: Joshua Antonishen on Patreon
98 lines
3.9 KiB
Markdown
98 lines
3.9 KiB
Markdown
You can use git-annex as a podcatcher, to download podcast contents.
|
|
No additional software is required, but your git-annex must be built
|
|
with the Feeds feature (run `git annex version` to check).
|
|
|
|
All you need to do is put something like this in a cron job:
|
|
|
|
`cd somerepo && git annex importfeed http://url/to/podcast http://other/podcast/url`
|
|
|
|
This downloads the urls, and parses them as RSS, Atom, or RDF feeds.
|
|
All enclosures are downloaded and added to the repository, the same as if you
|
|
had manually run `git annex addurl` on each of them.
|
|
|
|
git-annex will avoid downloading a file from a feed if its url has already
|
|
been stored in the repository before. So once a file is downloaded,
|
|
you can move it around, delete it, `git annex drop` its content, etc,
|
|
and it will not be downloaded again by repeated runs of
|
|
`git annex importfeed`. Just how a podcatcher should behave. (git-annex versions
|
|
since 2015 also tracks the podcast `guid` values, as metadata, to help avoid
|
|
duplication if the media file url changes; use `git annex metadata ...` to inspect.)
|
|
|
|
## templates
|
|
|
|
To control the filenames used for items downloaded from a feed,
|
|
there's a --template option. The default is
|
|
`--template='${feedtitle}/${itemtitle}${extension}'`
|
|
|
|
Other available template variables:
|
|
feedauthor, itemauthor, itemsummary, itemdescription, itemrights, itemid,
|
|
itempubdate, author, title.
|
|
|
|
## catching up
|
|
|
|
To catch up on a feed without downloading its contents,
|
|
use `git annex importfeed --relaxed`, and delete the symlinks it creates.
|
|
Next time you run `git annex addurl` it will only fetch any new items.
|
|
|
|
## fast mode
|
|
|
|
To add a feed without downloading its contents right now,
|
|
use `git annex importfeed --fast`. Then you can use `git annex get` as
|
|
usual to download the content of an item.
|
|
|
|
## storing the podcast list in git
|
|
|
|
You can check the list of podcast urls into git right next to the
|
|
files it downloads. Just make a file named feeds and add one podcast url
|
|
per line.
|
|
|
|
Then you can run git-annex on all the feeds:
|
|
|
|
`xargs git-annex importfeed < feeds`
|
|
|
|
## recreating lost episodes
|
|
|
|
If for some reason git-annex refuses to download files you are certain are in the podcast, it is quite possible it is because they have already been downloaded. In any case, you can use `--force` to redownload them:
|
|
|
|
`git-annex importfeed --force http://example.com/feed`
|
|
|
|
## distributed podcatching
|
|
|
|
A nice benefit of using git-annex as a podcatcher is that you can
|
|
run `git annex importfeed` on the same url in different clones
|
|
of a repository, and `git annex sync` will sync it all up.
|
|
|
|
## centralized podcatching
|
|
|
|
You can also have a designated machine which always fetches all podcstas
|
|
to local disk and stores them. That way, you can archive podcasts with
|
|
time-delayed deletion of upstream content. You can also work around slow
|
|
downloads upstream by podcatching to a server with ample bandwidth or work
|
|
around a slow local Internet connection by podcatching to your home server
|
|
and transferring to your laptop on demand.
|
|
|
|
## youtube channels
|
|
|
|
You can also use `git annex importfeed` on youtube channels.
|
|
It will use yt-dlp to automatically
|
|
download the videos.
|
|
|
|
To download a youtube channel, you need to find the feed associated with that
|
|
channel, and pass it to `git annex importfeed`. There does not seem to be
|
|
an easy link anywhere to get the feed, but you can construct its url
|
|
manually. For a channel url like
|
|
"https://www.youtube.com/channel/$foo", the
|
|
feed is "https://www.youtube.com/feeds/videos.xml?channel_id=$foo"
|
|
|
|
Use of yt-dlp is disabled by default as it can be a security risk.
|
|
See the documentation of annex.security.allowed-ip-addresses
|
|
in [[git-annex]] for details.)
|
|
|
|
## metadata
|
|
|
|
As well as storing the urls for items imported from a feed, git-annex can
|
|
store additional [[metadata]], like the author, and itemdescription.
|
|
This can then be looked up later, used in [[metadata_driven_views]], etc.
|
|
|
|
To make all available metadata from the feed be stored:
|
|
`git config annex.genmetadata true`
|