avoid url resume from 0

When downloading an url and the destination file exists but is empty,
avoid using http range to resume, since a range "bytes=0-" is an unusual
edge case that it's best to avoid relying on working.

This is known to fix a case where importfeed downloaded a partial feed from
such a server. Since importfeed uses withTmpFile, the destination always exists
empty, so it would particularly tickle such problem servers. Resuming from 0
is otherwise possible, but unlikely.
This commit is contained in:
Joey Hess 2019-06-20 12:26:17 -04:00
parent 06ea1c4228
commit 759fd9ea68
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 36 additions and 2 deletions

View file

@ -6,6 +6,10 @@ git-annex (7.20190616) UNRELEASED; urgency=medium
* Other commands also run their cleanup phase using a separate job pool
than their perform phase, which may make some of them somewhat faster
when running concurrently as well.
* When downloading an url and the destination file exists but is empty,
avoid using http range to resume, since a range "bytes=0-" is an unusual
edge case that it's best to avoid relying on working. This is known to
fix a case where importfeed downloaded a partial feed from such a server.
-- Joey Hess <id@joeyh.name> Sat, 15 Jun 2019 12:38:25 -0400

View file

@ -375,13 +375,13 @@ download' noerror meterupdate url file uo =
ftpport = 21
downloadconduit req = catchMaybeIO (getFileSize file) >>= \case
Nothing -> runResourceT $ do
Just sz | sz > 0 -> resumeconduit req' sz
_ -> runResourceT $ do
liftIO $ debugM "url" (show req')
resp <- http req' (httpManager uo)
if responseStatus resp == ok200
then store zeroBytesProcessed WriteMode resp
else showrespfailure resp
Just sz -> resumeconduit req' sz
where
req' = applyRequest uo $ req
-- Override http-client's default decompression of gzip

View file

@ -49,3 +49,5 @@ ok
### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
<3
> [[fixed|done]] --[[Joey]]

View file

@ -0,0 +1,28 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2019-06-20T15:18:49Z"
content="""
Somehow git-annex receives a truncated file from the web server,
so it is unable to parse it.
That only happens when using the haskell http library to download.
When git-annex is configured to use curl, it works.
So, workaround:
git -c annex.security.allowed-http-addresses=all -c annex.web-options=-4 annex importfeed \
https://www.deutschlandfunk.de/podcast-deutschlandfunk-der-tag.3417.de.podcast.xml
git-annex addurl downloads the complete file, so the problem does not
seem to be with the haskell http library, but something to do with how
importfeed is using it that causes a truncation.
Aha, importfeed uses withTmpFile, so the destination file exists with 0
size. This triggers a resume code path. And it looks to me like this web
server may not handle resume very well, it appears to send ~32kb
of data and not the whole file in that case.
So, the obvious fix is to not resume when the destination file is empty,
and I've done that.
"""]]