add Utility.HtmlDetect

This will be used in youtube-dl integration, to tell when a html page has
been downloaded by addurl, in which case it is worth running youtube-dl
to see if it can extract media from it.

tagsoup is an almost free dependency, because yesod depends on it.
So, this only really adds a dep when git-annex is built without the
webapp.

I'd like this to as closely as possible match how browsers decide if a
page is html or not. Unfortunately, that is fairly heuristic, in order
to support malformed html. And, we don't want to falsely detect
something as html just because it has something that looks like a html
tag embedded somewhere in it. Probably any major video hosting site is
going to be serving html documents that at least start with a <html>
tag, so requiring that or a DOCTYPE should be good enough.

This commit was sponsored by Jeff Goeke-Smith on Patreon.
This commit is contained in:
Joey Hess 2017-11-28 12:50:30 -04:00
parent c37838d1b7
commit 57b4c5bdff
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 46 additions and 0 deletions

View file

@ -23,6 +23,14 @@ Both of those changes would need changes to user's workflows and cron jobs.
git-annex could keep supporting quvi for some time, and warn when it uses
quvi, to help with the transition.
> Alternatively, git-annex addurl could download the url first, and then
> check the file to see if it looks like html. If so, run youtube-dl (which
> unfortunately has to download it again) and see if it manages to rip
> media from it. This way, addurl of non-html files does not have extra
> overhead, and the redundant download is fairly small compared to ripping
> the media. Only the unusual case where addurl is being used on html that
> does not contain media becomes more expensive.
Another gotcha is playlists. youtube-dl downloads playlists automatically.
But, git-annex needs to record an url that downloads a single file so that
`git annex get` works right. So, playlists will need to be disabled when