addurl: Avoid crashing when used on beegfs.

Sponsored-by: Dartmouth College's DANDI project
This commit is contained in:
Joey Hess 2021-07-05 13:02:40 -04:00
parent 17f8682d19
commit b9db859221
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
5 changed files with 67 additions and 2 deletions

View file

@ -2,6 +2,7 @@ git-annex (8.20210631) UNRELEASED; urgency=medium
* assistant: Avoid unncessary git repository repair in a situation where
git fsck gets confused about a commit that is made while it's running.
* addurl: Avoid crashing when used on beegfs.
-- Joey Hess <id@joeyh.name> Wed, 30 Jun 2021 17:55:10 -0400

View file

@ -315,7 +315,7 @@ downloadWeb addunlockedmatcher o url urlinfo file =
urlkey = addSizeUrlKey urlinfo $ Backend.URL.fromUrl url Nothing
downloader f p = Url.withUrlOptions $ downloadUrl urlkey p [url] f
go Nothing = return Nothing
go (Just tmp) = ifM (pure (not (rawOption o)) <&&> liftIO (isHtml <$> readFile (fromRawFilePath tmp)))
go (Just tmp) = ifM (pure (not (rawOption o)) <&&> liftIO (isHtmlFile (fromRawFilePath tmp)))
( tryyoutubedl tmp
, normalfinish tmp
)

View file

@ -1,6 +1,6 @@
{- html detection
-
- Copyright 2017 Joey Hess <id@joeyh.name>
- Copyright 2017-2021 Joey Hess <id@joeyh.name>
-
- License: BSD-2-clause
-}
@ -8,10 +8,12 @@
module Utility.HtmlDetect (
isHtml,
isHtmlBs,
isHtmlFile,
htmlPrefixLength,
) where
import Text.HTML.TagSoup
import System.IO
import Data.Char
import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString.Lazy.Char8 as B8
@ -44,6 +46,15 @@ isHtmlBs :: B.ByteString -> Bool
-- looks for ascii strings.
isHtmlBs = isHtml . B8.unpack
-- | Check if the file is html.
--
-- It would be equivilant to use isHtml <$> readFile file,
-- but since that would not read all of the file, the handle
-- would remain open until it got garbage collected sometime later.
isHtmlFile :: FilePath -> IO Bool
isHtmlFile file = withFile file ReadMode $ \h ->
isHtmlBs <$> B.hGet h htmlPrefixLength
-- | How much of the beginning of a html document is needed to detect it.
-- (conservatively)
htmlPrefixLength :: Int

View file

@ -83,3 +83,6 @@ on days ending with `y` it seems to work quite nicely.
[[!meta author=yoh]]
[[!tag projects/dandi]]
> [[fixed|done]], I think, though have not installed beegfs to test.
> --[[Joey]]

View file

@ -0,0 +1,50 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2021-07-05T16:18:39Z"
content="""
I've checked with strace, to see if the file was open while it was being
renamed. Not that there is anything generally wrong with renaming an open
file on a POSIX file system, but it would possibly be a problem on windows,
where some forms of opening a file locks it in place. And apparently
this filesystem is not trying to be very POSIX either.
413026 openat(AT_FDCWD, ".git/annex/tmp/URL-s3--file&c%%%tmp%foo", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 17
413026 write(17, "hi\n", 3) = 3
413026 close(17) = 0
...
413026 openat(AT_FDCWD, ".git/annex/tmp/URL-s3--file&c%%%tmp%foo", O_RDONLY|O_NOCTTY|O_NONBLOCK) = 11
413026 read(11, "hi\n", 8192) = 3
...
413026 openat(AT_FDCWD, ".git/annex/tmp/URL-s3--file&c%%%tmp%foo", O_RDONLY|O_NOCTTY|O_NONBLOCK <unfinished ...>
413028 <... futex resumed>) = 0
413026 <... openat resumed>) = 16
...
413026 read(16, "hi\n", 32752) = 3
...
413026 close(16) = 0
...
413026 rename(".git/annex/tmp/URL-s3--file&c%%%tmp%foo", "_tmp_foo") = 0
...
413028 close(11) = 0
So the file is left open across the rename, which ought to be able to be
changed and would presumably fix the problem.
It's also a bit odd that the file gets read twice after being copied,
once for checksum makes sense, but what's the other one?
(Copying while checksumming should be able to avoid one of the reads,
but there is an open todo tracking progress on that.)
Aah, the other read is when it's probing if the file is html in case it ought
to be passed off to youtube-dl. That is the read that lingers for a while,
because it's done with a lazy readFile and probing if the file is html doesn't
read to the end and close it, so the file handle lingers until the GC gets
around to closing it. Of course youtube-dl won't be able to do anything with a
file url, but git-annex doesn't know that. And anyway the failure on this
filesystem would also happen when adding a http url.
Ok, fixed it to close the handle promptly. That should fix the test suite.
It does not seem unlikely that something else will break due to this
filesystem's unusual behavior though.
"""]]