addurl: Avoid crashing when used on beegfs.
Sponsored-by: Dartmouth College's DANDI project
This commit is contained in:
parent
17f8682d19
commit
b9db859221
5 changed files with 67 additions and 2 deletions
|
@ -2,6 +2,7 @@ git-annex (8.20210631) UNRELEASED; urgency=medium
|
|||
|
||||
* assistant: Avoid unncessary git repository repair in a situation where
|
||||
git fsck gets confused about a commit that is made while it's running.
|
||||
* addurl: Avoid crashing when used on beegfs.
|
||||
|
||||
-- Joey Hess <id@joeyh.name> Wed, 30 Jun 2021 17:55:10 -0400
|
||||
|
||||
|
|
|
@ -315,7 +315,7 @@ downloadWeb addunlockedmatcher o url urlinfo file =
|
|||
urlkey = addSizeUrlKey urlinfo $ Backend.URL.fromUrl url Nothing
|
||||
downloader f p = Url.withUrlOptions $ downloadUrl urlkey p [url] f
|
||||
go Nothing = return Nothing
|
||||
go (Just tmp) = ifM (pure (not (rawOption o)) <&&> liftIO (isHtml <$> readFile (fromRawFilePath tmp)))
|
||||
go (Just tmp) = ifM (pure (not (rawOption o)) <&&> liftIO (isHtmlFile (fromRawFilePath tmp)))
|
||||
( tryyoutubedl tmp
|
||||
, normalfinish tmp
|
||||
)
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
{- html detection
|
||||
-
|
||||
- Copyright 2017 Joey Hess <id@joeyh.name>
|
||||
- Copyright 2017-2021 Joey Hess <id@joeyh.name>
|
||||
-
|
||||
- License: BSD-2-clause
|
||||
-}
|
||||
|
@ -8,10 +8,12 @@
|
|||
module Utility.HtmlDetect (
|
||||
isHtml,
|
||||
isHtmlBs,
|
||||
isHtmlFile,
|
||||
htmlPrefixLength,
|
||||
) where
|
||||
|
||||
import Text.HTML.TagSoup
|
||||
import System.IO
|
||||
import Data.Char
|
||||
import qualified Data.ByteString.Lazy as B
|
||||
import qualified Data.ByteString.Lazy.Char8 as B8
|
||||
|
@ -44,6 +46,15 @@ isHtmlBs :: B.ByteString -> Bool
|
|||
-- looks for ascii strings.
|
||||
isHtmlBs = isHtml . B8.unpack
|
||||
|
||||
-- | Check if the file is html.
|
||||
--
|
||||
-- It would be equivilant to use isHtml <$> readFile file,
|
||||
-- but since that would not read all of the file, the handle
|
||||
-- would remain open until it got garbage collected sometime later.
|
||||
isHtmlFile :: FilePath -> IO Bool
|
||||
isHtmlFile file = withFile file ReadMode $ \h ->
|
||||
isHtmlBs <$> B.hGet h htmlPrefixLength
|
||||
|
||||
-- | How much of the beginning of a html document is needed to detect it.
|
||||
-- (conservatively)
|
||||
htmlPrefixLength :: Int
|
||||
|
|
|
@ -83,3 +83,6 @@ on days ending with `y` it seems to work quite nicely.
|
|||
|
||||
[[!meta author=yoh]]
|
||||
[[!tag projects/dandi]]
|
||||
|
||||
> [[fixed|done]], I think, though have not installed beegfs to test.
|
||||
> --[[Joey]]
|
||||
|
|
|
@ -0,0 +1,50 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 3"""
|
||||
date="2021-07-05T16:18:39Z"
|
||||
content="""
|
||||
I've checked with strace, to see if the file was open while it was being
|
||||
renamed. Not that there is anything generally wrong with renaming an open
|
||||
file on a POSIX file system, but it would possibly be a problem on windows,
|
||||
where some forms of opening a file locks it in place. And apparently
|
||||
this filesystem is not trying to be very POSIX either.
|
||||
|
||||
413026 openat(AT_FDCWD, ".git/annex/tmp/URL-s3--file&c%%%tmp%foo", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 17
|
||||
413026 write(17, "hi\n", 3) = 3
|
||||
413026 close(17) = 0
|
||||
...
|
||||
413026 openat(AT_FDCWD, ".git/annex/tmp/URL-s3--file&c%%%tmp%foo", O_RDONLY|O_NOCTTY|O_NONBLOCK) = 11
|
||||
413026 read(11, "hi\n", 8192) = 3
|
||||
...
|
||||
413026 openat(AT_FDCWD, ".git/annex/tmp/URL-s3--file&c%%%tmp%foo", O_RDONLY|O_NOCTTY|O_NONBLOCK <unfinished ...>
|
||||
413028 <... futex resumed>) = 0
|
||||
413026 <... openat resumed>) = 16
|
||||
...
|
||||
413026 read(16, "hi\n", 32752) = 3
|
||||
...
|
||||
413026 close(16) = 0
|
||||
...
|
||||
413026 rename(".git/annex/tmp/URL-s3--file&c%%%tmp%foo", "_tmp_foo") = 0
|
||||
...
|
||||
413028 close(11) = 0
|
||||
|
||||
So the file is left open across the rename, which ought to be able to be
|
||||
changed and would presumably fix the problem.
|
||||
|
||||
It's also a bit odd that the file gets read twice after being copied,
|
||||
once for checksum makes sense, but what's the other one?
|
||||
(Copying while checksumming should be able to avoid one of the reads,
|
||||
but there is an open todo tracking progress on that.)
|
||||
|
||||
Aah, the other read is when it's probing if the file is html in case it ought
|
||||
to be passed off to youtube-dl. That is the read that lingers for a while,
|
||||
because it's done with a lazy readFile and probing if the file is html doesn't
|
||||
read to the end and close it, so the file handle lingers until the GC gets
|
||||
around to closing it. Of course youtube-dl won't be able to do anything with a
|
||||
file url, but git-annex doesn't know that. And anyway the failure on this
|
||||
filesystem would also happen when adding a http url.
|
||||
|
||||
Ok, fixed it to close the handle promptly. That should fix the test suite.
|
||||
It does not seem unlikely that something else will break due to this
|
||||
filesystem's unusual behavior though.
|
||||
"""]]
|
Loading…
Add table
Reference in a new issue