avoid url resume from 0

When downloading an url and the destination file exists but is empty, avoid using http range to resume, since a range "bytes=0-" is an unusual edge case that it's best to avoid relying on working. This is known to fix a case where importfeed downloaded a partial feed from such a server. Since importfeed uses withTmpFile, the destination always exists empty, so it would particularly tickle such problem servers. Resuming from 0 is otherwise possible, but unlikely.
2019-06-20 12:26:17 -04:00 · 2019-06-20 12:26:17 -04:00 · 759fd9ea68
commit 759fd9ea68
parent 06ea1c4228
4 changed files with 36 additions and 2 deletions
--- a/4
+++ b/4
@ -6,6 +6,10 @@ git-annex (7.20190616) UNRELEASED; urgency=medium
  * Other commands also run their cleanup phase using a separate job pool
    than their perform phase, which may make some of them somewhat faster
    when running concurrently as well.
  * When downloading an url and the destination file exists but is empty,
    avoid using http range to resume, since a range "bytes=0-" is an unusual
    edge case that it's best to avoid relying on working. This is known to
    fix a case where importfeed downloaded a partial feed from such a server.
 -- Joey Hess <id@joeyh.name>  Sat, 15 Jun 2019 12:38:25 -0400
--- a/Utility/Url.hs
+++ b/Utility/Url.hs
@ -375,13 +375,13 @@ download' noerror meterupdate url file uo =
 	ftpport = 21
 	downloadconduit req = catchMaybeIO (getFileSize file) >>= \case
-		Nothing -> runResourceT $ do
+		Just sz | sz > 0 -> resumeconduit req' sz
 		_ -> runResourceT $ do
 			liftIO $ debugM "url" (show req')
 			resp <- http req' (httpManager uo)
 			if responseStatus resp == ok200
 				then store zeroBytesProcessed WriteMode resp
 				else showrespfailure resp
 		Just sz -> resumeconduit req' sz
 	  where
 		req' = applyRequest uo $ req
 			-- Override http-client's default decompression of gzip
--- a/doc/bugs/importfeed_34parsing_the_feed_failed34_without_further_info.mdwn
+++ b/doc/bugs/importfeed_34parsing_the_feed_failed34_without_further_info.mdwn
@ -49,3 +49,5 @@ ok
 ### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
 <3
 > [[fixed|done]] --[[Joey]]
--- a/doc/bugs/importfeed_34parsing_the_feed_failed34_without_further_info/comment_1_25df292c558fb470b5db4a67461ce788._comment
+++ b/doc/bugs/importfeed_34parsing_the_feed_failed34_without_further_info/comment_1_25df292c558fb470b5db4a67461ce788._comment
@ -0,0 +1,28 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 1"""
 date="2019-06-20T15:18:49Z"
 content="""
 Somehow git-annex receives a truncated file from the web server,
 so it is unable to parse it.
 That only happens when using the haskell http library to download.
 When git-annex is configured to use curl, it works.
 So, workaround:
 	git -c annex.security.allowed-http-addresses=all -c annex.web-options=-4 annex importfeed \
 		https://www.deutschlandfunk.de/podcast-deutschlandfunk-der-tag.3417.de.podcast.xml
 git-annex addurl downloads the complete file, so the problem does not
 seem to be with the haskell http library, but something to do with how
 importfeed is using it that causes a truncation.
 Aha, importfeed uses withTmpFile, so the destination file exists with 0
 size. This triggers a resume code path. And it looks to me like this web
 server may not handle resume very well, it appears to send ~32kb
 of data and not the whole file in that case.
 So, the obvious fix is to not resume when the destination file is empty,
 and I've done that.
 """]]