git-annex/Utility/HtmlDetect.hs

{- html detection
 -
 - Copyright 2017-2021 Joey Hess <id@joeyh.name>
 -
 - License: BSD-2-clause
 -}

module Utility.HtmlDetect (
	isHtml,
	isHtmlBs,
	isHtmlFile,
	htmlPrefixLength,
) where

import Text.HTML.TagSoup
import System.IO
import Data.Char
import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString.Lazy.Char8 as B8

-- | Detect if a String is a html document.
--
-- The document many not be valid, or may be truncated, and will
-- still be detected as html, as long as it starts with a
-- "<html>" or "<!DOCTYPE html>" tag.
--
-- Html fragments like "<p>this</p>" are not detected as being html,
-- although some browsers may chose to render them as html.
isHtml :: String -> Bool
isHtml = evaluate . canonicalizeTags . parseTags . take htmlPrefixLength
  where
	evaluate (TagOpen "!DOCTYPE" ((t, _):_):_) = map toLower t == "html"
	evaluate (TagOpen "html" _:_) = True
	-- Allow some leading whitespace before the tag.
	evaluate (TagText t:rest)
		| all isSpace t = evaluate rest
		| otherwise = False
	-- It would be pretty weird to have a html comment before the html
	-- tag, but easy to allow for.
	evaluate (TagComment _:rest) = evaluate rest
	evaluate _ = False

-- | Detect if a ByteString is a html document.
isHtmlBs :: B.ByteString -> Bool
-- The encoding of the ByteString is not known, but isHtml only
-- looks for ascii strings.
isHtmlBs = isHtml . B8.unpack

-- | Check if the file is html.
--
-- It would be equivalent to use isHtml <$> readFile file,
-- but since that would not read all of the file, the handle
-- would remain open until it got garbage collected sometime later.
isHtmlFile :: FilePath -> IO Bool
isHtmlFile file = withFile file ReadMode $ \h ->
	isHtmlBs <$> B.hGet h htmlPrefixLength

-- | How much of the beginning of a html document is needed to detect it.
-- (conservatively)
htmlPrefixLength :: Int
htmlPrefixLength = 8192
add Utility.HtmlDetect This will be used in youtube-dl integration, to tell when a html page has been downloaded by addurl, in which case it is worth running youtube-dl to see if it can extract media from it. tagsoup is an almost free dependency, because yesod depends on it. So, this only really adds a dep when git-annex is built without the webapp. I'd like this to as closely as possible match how browsers decide if a page is html or not. Unfortunately, that is fairly heuristic, in order to support malformed html. And, we don't want to falsely detect something as html just because it has something that looks like a html tag embedded somewhere in it. Probably any major video hosting site is going to be serving html documents that at least start with a <html> tag, so requiring that or a DOCTYPE should be good enough. This commit was sponsored by Jeff Goeke-Smith on Patreon. 2017-11-28 16:50:30 +00:00			`{- html detection`
			`-`
addurl: Avoid crashing when used on beegfs. Sponsored-by: Dartmouth College's DANDI project 2021-07-05 17:02:40 +00:00			`- Copyright 2017-2021 Joey Hess <id@joeyh.name>`
add Utility.HtmlDetect This will be used in youtube-dl integration, to tell when a html page has been downloaded by addurl, in which case it is worth running youtube-dl to see if it can extract media from it. tagsoup is an almost free dependency, because yesod depends on it. So, this only really adds a dep when git-annex is built without the webapp. I'd like this to as closely as possible match how browsers decide if a page is html or not. Unfortunately, that is fairly heuristic, in order to support malformed html. And, we don't want to falsely detect something as html just because it has something that looks like a html tag embedded somewhere in it. Probably any major video hosting site is going to be serving html documents that at least start with a <html> tag, so requiring that or a DOCTYPE should be good enough. This commit was sponsored by Jeff Goeke-Smith on Patreon. 2017-11-28 16:50:30 +00:00			`-`
			`- License: BSD-2-clause`
			`-}`

explict export lists Eliminated some dead code. In other cases, exported a currently unused function, since it was a logical part of the API. Of course this improves the API documentation. It may also sometimes let ghc optimize code better, since it can know a function is internal to a module. 364 modules still to go, according to git grep -E 'module [A-Za-z.]+ where' 2019-11-21 19:38:06 +00:00			`module Utility.HtmlDetect (`
			`isHtml,`
			`isHtmlBs,`
addurl: Avoid crashing when used on beegfs. Sponsored-by: Dartmouth College's DANDI project 2021-07-05 17:02:40 +00:00			`isHtmlFile,`
explict export lists Eliminated some dead code. In other cases, exported a currently unused function, since it was a logical part of the API. Of course this improves the API documentation. It may also sometimes let ghc optimize code better, since it can know a function is internal to a module. 364 modules still to go, according to git grep -E 'module [A-Za-z.]+ where' 2019-11-21 19:38:06 +00:00			`htmlPrefixLength,`
			`) where`
add Utility.HtmlDetect This will be used in youtube-dl integration, to tell when a html page has been downloaded by addurl, in which case it is worth running youtube-dl to see if it can extract media from it. tagsoup is an almost free dependency, because yesod depends on it. So, this only really adds a dep when git-annex is built without the webapp. I'd like this to as closely as possible match how browsers decide if a page is html or not. Unfortunately, that is fairly heuristic, in order to support malformed html. And, we don't want to falsely detect something as html just because it has something that looks like a html tag embedded somewhere in it. Probably any major video hosting site is going to be serving html documents that at least start with a <html> tag, so requiring that or a DOCTYPE should be good enough. This commit was sponsored by Jeff Goeke-Smith on Patreon. 2017-11-28 16:50:30 +00:00
			`import Text.HTML.TagSoup`
addurl: Avoid crashing when used on beegfs. Sponsored-by: Dartmouth College's DANDI project 2021-07-05 17:02:40 +00:00			`import System.IO`
add Utility.HtmlDetect This will be used in youtube-dl integration, to tell when a html page has been downloaded by addurl, in which case it is worth running youtube-dl to see if it can extract media from it. tagsoup is an almost free dependency, because yesod depends on it. So, this only really adds a dep when git-annex is built without the webapp. I'd like this to as closely as possible match how browsers decide if a page is html or not. Unfortunately, that is fairly heuristic, in order to support malformed html. And, we don't want to falsely detect something as html just because it has something that looks like a html tag embedded somewhere in it. Probably any major video hosting site is going to be serving html documents that at least start with a <html> tag, so requiring that or a DOCTYPE should be good enough. This commit was sponsored by Jeff Goeke-Smith on Patreon. 2017-11-28 16:50:30 +00:00			`import Data.Char`
fix regression in addurl --file caused by youtube-dl support Now youtubeDlCheck downloads the beginning of the url's content and checks if it's html, only when it is does it pass it off the youtube-dl to check if it supports it. This means more work is done for urls that youtube-dl does support, but is probably more efficient for other urls, since it only downloads the first chunk of content, while youtube-dl probably downloads more. As well as the reported bug, this also fixes behavior when an url was added with youtube-dl, but the url content has now changed from a html page to something else. Remote.Web.checkKey used to wrongly succeed in that situation, since youtube-dl said sure it can download that something else. This commit was supported by the NSF-funded DataLad project. 2017-12-06 17:16:06 +00:00			`import qualified Data.ByteString.Lazy as B`
			`import qualified Data.ByteString.Lazy.Char8 as B8`
add Utility.HtmlDetect This will be used in youtube-dl integration, to tell when a html page has been downloaded by addurl, in which case it is worth running youtube-dl to see if it can extract media from it. tagsoup is an almost free dependency, because yesod depends on it. So, this only really adds a dep when git-annex is built without the webapp. I'd like this to as closely as possible match how browsers decide if a page is html or not. Unfortunately, that is fairly heuristic, in order to support malformed html. And, we don't want to falsely detect something as html just because it has something that looks like a html tag embedded somewhere in it. Probably any major video hosting site is going to be serving html documents that at least start with a <html> tag, so requiring that or a DOCTYPE should be good enough. This commit was sponsored by Jeff Goeke-Smith on Patreon. 2017-11-28 16:50:30 +00:00
fix regression in addurl --file caused by youtube-dl support Now youtubeDlCheck downloads the beginning of the url's content and checks if it's html, only when it is does it pass it off the youtube-dl to check if it supports it. This means more work is done for urls that youtube-dl does support, but is probably more efficient for other urls, since it only downloads the first chunk of content, while youtube-dl probably downloads more. As well as the reported bug, this also fixes behavior when an url was added with youtube-dl, but the url content has now changed from a html page to something else. Remote.Web.checkKey used to wrongly succeed in that situation, since youtube-dl said sure it can download that something else. This commit was supported by the NSF-funded DataLad project. 2017-12-06 17:16:06 +00:00			`-- \| Detect if a String is a html document.`
add Utility.HtmlDetect This will be used in youtube-dl integration, to tell when a html page has been downloaded by addurl, in which case it is worth running youtube-dl to see if it can extract media from it. tagsoup is an almost free dependency, because yesod depends on it. So, this only really adds a dep when git-annex is built without the webapp. I'd like this to as closely as possible match how browsers decide if a page is html or not. Unfortunately, that is fairly heuristic, in order to support malformed html. And, we don't want to falsely detect something as html just because it has something that looks like a html tag embedded somewhere in it. Probably any major video hosting site is going to be serving html documents that at least start with a <html> tag, so requiring that or a DOCTYPE should be good enough. This commit was sponsored by Jeff Goeke-Smith on Patreon. 2017-11-28 16:50:30 +00:00			`--`
fix regression in addurl --file caused by youtube-dl support Now youtubeDlCheck downloads the beginning of the url's content and checks if it's html, only when it is does it pass it off the youtube-dl to check if it supports it. This means more work is done for urls that youtube-dl does support, but is probably more efficient for other urls, since it only downloads the first chunk of content, while youtube-dl probably downloads more. As well as the reported bug, this also fixes behavior when an url was added with youtube-dl, but the url content has now changed from a html page to something else. Remote.Web.checkKey used to wrongly succeed in that situation, since youtube-dl said sure it can download that something else. This commit was supported by the NSF-funded DataLad project. 2017-12-06 17:16:06 +00:00			`-- The document many not be valid, or may be truncated, and will`
			`-- still be detected as html, as long as it starts with a`
			`-- "<html>" or "<!DOCTYPE html>" tag.`
add Utility.HtmlDetect This will be used in youtube-dl integration, to tell when a html page has been downloaded by addurl, in which case it is worth running youtube-dl to see if it can extract media from it. tagsoup is an almost free dependency, because yesod depends on it. So, this only really adds a dep when git-annex is built without the webapp. I'd like this to as closely as possible match how browsers decide if a page is html or not. Unfortunately, that is fairly heuristic, in order to support malformed html. And, we don't want to falsely detect something as html just because it has something that looks like a html tag embedded somewhere in it. Probably any major video hosting site is going to be serving html documents that at least start with a <html> tag, so requiring that or a DOCTYPE should be good enough. This commit was sponsored by Jeff Goeke-Smith on Patreon. 2017-11-28 16:50:30 +00:00			`--`
			`-- Html fragments like "<p>this</p>" are not detected as being html,`
			`-- although some browsers may chose to render them as html.`
			`isHtml :: String -> Bool`
fix regression in addurl --file caused by youtube-dl support Now youtubeDlCheck downloads the beginning of the url's content and checks if it's html, only when it is does it pass it off the youtube-dl to check if it supports it. This means more work is done for urls that youtube-dl does support, but is probably more efficient for other urls, since it only downloads the first chunk of content, while youtube-dl probably downloads more. As well as the reported bug, this also fixes behavior when an url was added with youtube-dl, but the url content has now changed from a html page to something else. Remote.Web.checkKey used to wrongly succeed in that situation, since youtube-dl said sure it can download that something else. This commit was supported by the NSF-funded DataLad project. 2017-12-06 17:16:06 +00:00			`isHtml = evaluate . canonicalizeTags . parseTags . take htmlPrefixLength`
add Utility.HtmlDetect This will be used in youtube-dl integration, to tell when a html page has been downloaded by addurl, in which case it is worth running youtube-dl to see if it can extract media from it. tagsoup is an almost free dependency, because yesod depends on it. So, this only really adds a dep when git-annex is built without the webapp. I'd like this to as closely as possible match how browsers decide if a page is html or not. Unfortunately, that is fairly heuristic, in order to support malformed html. And, we don't want to falsely detect something as html just because it has something that looks like a html tag embedded somewhere in it. Probably any major video hosting site is going to be serving html documents that at least start with a <html> tag, so requiring that or a DOCTYPE should be good enough. This commit was sponsored by Jeff Goeke-Smith on Patreon. 2017-11-28 16:50:30 +00:00			`where`
			`evaluate (TagOpen "!DOCTYPE" ((t, _):_):_) = map toLower t == "html"`
			`evaluate (TagOpen "html" _:_) = True`
			`-- Allow some leading whitespace before the tag.`
			`evaluate (TagText t:rest)`
			`\| all isSpace t = evaluate rest`
			`\| otherwise = False`
			`-- It would be pretty weird to have a html comment before the html`
			`-- tag, but easy to allow for.`
			`evaluate (TagComment _:rest) = evaluate rest`
			`evaluate _ = False`
fix regression in addurl --file caused by youtube-dl support Now youtubeDlCheck downloads the beginning of the url's content and checks if it's html, only when it is does it pass it off the youtube-dl to check if it supports it. This means more work is done for urls that youtube-dl does support, but is probably more efficient for other urls, since it only downloads the first chunk of content, while youtube-dl probably downloads more. As well as the reported bug, this also fixes behavior when an url was added with youtube-dl, but the url content has now changed from a html page to something else. Remote.Web.checkKey used to wrongly succeed in that situation, since youtube-dl said sure it can download that something else. This commit was supported by the NSF-funded DataLad project. 2017-12-06 17:16:06 +00:00
			`-- \| Detect if a ByteString is a html document.`
			`isHtmlBs :: B.ByteString -> Bool`
			`-- The encoding of the ByteString is not known, but isHtml only`
			`-- looks for ascii strings.`
			`isHtmlBs = isHtml . B8.unpack`

addurl: Avoid crashing when used on beegfs. Sponsored-by: Dartmouth College's DANDI project 2021-07-05 17:02:40 +00:00			`-- \| Check if the file is html.`
			`--`
Apply codespell -w throughout 2023-03-14 02:39:16 +00:00			`-- It would be equivalent to use isHtml <$> readFile file,`
addurl: Avoid crashing when used on beegfs. Sponsored-by: Dartmouth College's DANDI project 2021-07-05 17:02:40 +00:00			`-- but since that would not read all of the file, the handle`
			`-- would remain open until it got garbage collected sometime later.`
			`isHtmlFile :: FilePath -> IO Bool`
			`isHtmlFile file = withFile file ReadMode $ \h ->`
			`isHtmlBs <$> B.hGet h htmlPrefixLength`

fix regression in addurl --file caused by youtube-dl support Now youtubeDlCheck downloads the beginning of the url's content and checks if it's html, only when it is does it pass it off the youtube-dl to check if it supports it. This means more work is done for urls that youtube-dl does support, but is probably more efficient for other urls, since it only downloads the first chunk of content, while youtube-dl probably downloads more. As well as the reported bug, this also fixes behavior when an url was added with youtube-dl, but the url content has now changed from a html page to something else. Remote.Web.checkKey used to wrongly succeed in that situation, since youtube-dl said sure it can download that something else. This commit was supported by the NSF-funded DataLad project. 2017-12-06 17:16:06 +00:00			`-- \| How much of the beginning of a html document is needed to detect it.`
			`-- (conservatively)`
			`htmlPrefixLength :: Int`
			`htmlPrefixLength = 8192`