avoid url resume from 0

When downloading an url and the destination file exists but is empty, avoid using http range to resume, since a range "bytes=0-" is an unusual edge case that it's best to avoid relying on working. This is known to fix a case where importfeed downloaded a partial feed from such a server. Since importfeed uses withTmpFile, the destination always exists empty, so it would particularly tickle such problem servers. Resuming from 0 is otherwise possible, but unlikely.
2019-06-20 12:26:17 -04:00 · 2019-06-20 12:26:17 -04:00 · 759fd9ea68
commit 759fd9ea68
parent 06ea1c4228
4 changed files with 36 additions and 2 deletions
--- a/4
+++ b/4
@ -6,6 +6,10 @@ git-annex (7.20190616) UNRELEASED; urgency=medium
  * Other commands also run their cleanup phase using a separate job pool
    than their perform phase, which may make some of them somewhat faster
    when running concurrently as well.
+  * When downloading an url and the destination file exists but is empty,
+    avoid using http range to resume, since a range "bytes=0-" is an unusual
+    edge case that it's best to avoid relying on working. This is known to
+    fix a case where importfeed downloaded a partial feed from such a server.

 -- Joey Hess <id@joeyh.name>  Sat, 15 Jun 2019 12:38:25 -0400

--- a/Utility/Url.hs
+++ b/Utility/Url.hs
@ -375,13 +375,13 @@ download' noerror meterupdate url file uo =
 	ftpport = 21

 	downloadconduit req = catchMaybeIO (getFileSize file) >>= \case
-		Nothing -> runResourceT $ do
+		Just sz | sz > 0 -> resumeconduit req' sz
+		_ -> runResourceT $ do
 			liftIO $ debugM "url" (show req')
 			resp <- http req' (httpManager uo)
 			if responseStatus resp == ok200
 				then store zeroBytesProcessed WriteMode resp
 				else showrespfailure resp
-		Just sz -> resumeconduit req' sz
 	  where
 		req' = applyRequest uo $ req
 			-- Override http-client's default decompression of gzip
--- a/doc/bugs/importfeed_34parsing_the_feed_failed34_without_further_info.mdwn
+++ b/doc/bugs/importfeed_34parsing_the_feed_failed34_without_further_info.mdwn
@ -49,3 +49,5 @@ ok
 ### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)

 <3
+
+> [[fixed|done]] --[[Joey]]
--- a/doc/bugs/importfeed_34parsing_the_feed_failed34_without_further_info/comment_1_25df292c558fb470b5db4a67461ce788._comment
+++ b/doc/bugs/importfeed_34parsing_the_feed_failed34_without_further_info/comment_1_25df292c558fb470b5db4a67461ce788._comment
@ -0,0 +1,28 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2019-06-20T15:18:49Z"
+ content="""
+Somehow git-annex receives a truncated file from the web server,
+so it is unable to parse it.
+
+That only happens when using the haskell http library to download.
+When git-annex is configured to use curl, it works.
+
+So, workaround:
+
+	git -c annex.security.allowed-http-addresses=all -c annex.web-options=-4 annex importfeed \
+		https://www.deutschlandfunk.de/podcast-deutschlandfunk-der-tag.3417.de.podcast.xml
+
+git-annex addurl downloads the complete file, so the problem does not
+seem to be with the haskell http library, but something to do with how
+importfeed is using it that causes a truncation.
+
+Aha, importfeed uses withTmpFile, so the destination file exists with 0
+size. This triggers a resume code path. And it looks to me like this web
+server may not handle resume very well, it appears to send ~32kb
+of data and not the whole file in that case.
+
+So, the obvious fix is to not resume when the destination file is empty,
+and I've done that.
+"""]]