From ff682b6029429e4d679225b2e9fdb69e82447a8b Mon Sep 17 00:00:00 2001 From: ewen Date: Sun, 15 Jul 2018 02:17:05 +0000 Subject: [PATCH 1/4] Regression in podcast feed download for one RSS feed --- ...__episodes__63__format__61__RSS__34__.mdwn | 228 ++++++++++++++++++ 1 file changed, 228 insertions(+) create mode 100644 doc/bugs/importfeed_bad_request_for___34__episodes__63__format__61__RSS__34__.mdwn diff --git a/doc/bugs/importfeed_bad_request_for___34__episodes__63__format__61__RSS__34__.mdwn b/doc/bugs/importfeed_bad_request_for___34__episodes__63__format__61__RSS__34__.mdwn new file mode 100644 index 0000000000..27933ceefd --- /dev/null +++ b/doc/bugs/importfeed_bad_request_for___34__episodes__63__format__61__RSS__34__.mdwn @@ -0,0 +1,228 @@ +### Please describe the problem. + +Since upgrading to `6.20180626-gdf91a5cff` (pre-built OS X binary on OS X 10.11) it appears `git annex importfeed` is unable to properly download a RSS feed from one particular site -- it gets back `400 Bad Request` from the website, and then unsurprisingly reports `warning: bad feed content; no enclosures to download`. The relevant RSS feed appears sane when downloaded from the same URL with `wget` or `curl`. The previous `git-annex` version I was using on this machine (before the security update), `6.20170320-g41c5d9d` was able to download this RSS feed (I realise that's a rather large window of changes!). + +It appears the older version (`6.20170320-g41c5d9d`) used `wget` to download the podcast RSS, and the newer version (`6.20180626-gdf91a5cff`) seems to fetch the RSS internally in a way that does not work with this RSS feed (nor does the newer version appear to provide debug output of what it is doing at that step that goes wrong, unlike the previous version which shows the explicit `wget` command run, even if multiple `--debug` and/or `--verbose` options are provided). + +All other podcast RSS feeds work on both versions. + +As best I can tell from packet captures and `telnet` debugging, the issue is that the new `git annex importfeed` does not send a `User-Agent:` header. Possibly some anti-DoS/anti-spam front end is rejecting it as a result? + +Can a suitable `User-Agent:` header be added to the `git annex importfeed` HTTP requests? + +(*Note* the bug title probably needs updating since I seem to have discovered the root cause after starting creating the bug.) + +### What steps will reproduce the problem? + +Attempt to `git annex importfeed` the feed `'https://theythempodcast.com/episodes?format=RSS'`. The problem also appears reproducible with `'http://theythempodcast.com/episodes?format=RSS'` (ie, unencrypted), which makes debugging a wee bit simpler (ie, packet captures can help). + +Failing, on `6.20180626-gdf91a5cff`: + + ewen@ashram:~/Music/podcasts$ git annex --debug importfeed --template="${TEMPLATE}" 'https://theythempodcast.com/episodes?format=RSS' 2>&1 + importfeed checking known urls [2018-07-15 13:36:53.298188] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","git-annex"] + [2018-07-15 13:36:53.306676] process done ExitSuccess + [2018-07-15 13:36:53.306768] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","--hash","refs/heads/git-annex"] + [2018-07-15 13:36:53.314672] process done ExitSuccess + [2018-07-15 13:36:53.315353] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","ls-files","--stage","-z","--","."] + [2018-07-15 13:36:53.325249] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"] + [2018-07-15 13:36:53.32666] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"] + [2018-07-15 13:36:53.768979] process done ExitSuccess + ok + importfeed https://theythempodcast.com/episodes?format=RSS download failed: + + 400 Bad Request + + + +

400 Bad Request

+

e64OsAKG/PqPs52sZ @ Sun, 15 Jul 2018 01:36:57 GMT
+

SEC-43
+


+    
+    
+
+      warning: bad feed content; no enclosures to download
+    ok
+    [2018-07-15 13:36:58.076606] process done ExitSuccess
+    [2018-07-15 13:36:58.077183] process done ExitSuccess
+    ewen@ashram:~/Music/podcasts$ 
+
+Working, on `6.20170320-g41c5d9d`:
+
+
+ewen@ashram:~/Music/podcasts$ /Applications/OpenSource/git-annex-2017-03-20.app/Contents/MacOS/git-annex importfeed --template="${TEMPLATE}" 'https://theythempodcast.com/episodes?format=RSS'
+importfeed checking known urls ok
+importfeed https://theythempodcast.com/episodes?format=RSS 
+/var/folders/p1/gpj 100%[===================>]  32.62K  61.4KB/s    in 0.5s    
+2018-07-15 13:31:17 URL:https://theythempodcast.com/episodes?format=RSS [33404/33404] -> "/var/folders/p1/gpjb1g7s00zgp64p87w19pgm0000gn/T/feed16807282475249" [1]
+ok
+ewen@ashram:~/Music/podcasts$ /Applications/OpenSource/git-annex-2017-03-20.app/Contents/MacOS/git-annex --debug importfeed --template="${TEMPLATE}" 'https://theythempodcast.com/episodes?format=RSS' 2>&1 | grep -A 1 "format=RSS"
+importfeed https://theythempodcast.com/episodes?format=RSS 
+[2018-07-15 13:33:05.367548] call: wget ["-nv","--show-progress","--clobber","-c","-O","/var/folders/p1/gpjb1g7s00zgp64p87w19pgm0000gn/T/feed16807282475249","https://theythempodcast.com/episodes?format=RSS","--user-agent","git-annex/6.20170320-g41c5d9d"]
+
+     0K .......... .......... .......... ..                   100%  121K=0.3s2018-07-15 13:33:07 URL:https://theythempodcast.com/episodes?format=RSS [33404/33404] -> "/var/folders/p1/gpjb1g7s00zgp64p87w19pgm0000gn/T/feed16807282475249" [1]
+[2018-07-15 13:33:07.050443] process done ExitSuccess
+ewen@ashram:~/Music/podcasts$ 
+
+ +In both cases the `TEMPLATE` value is the same one I've used for years, and works with all the other podcast feeds: + + ewen@ashram:~/Music/podcasts$ set | grep TEMPLATE + TEMPLATE='archive/${feedtitle}/${itemtitle}${extension}' + ewen@ashram:~/Music/podcasts$ + +### What version of git-annex are you using? On what operating system? + +OS X 10.11 (El Capitan), using the pre-built OS X binaries downloaded just after the security release came out. + +
+ewen@ashram:~/Music/podcasts$ git annex version
+git-annex version: 6.20180626-gdf91a5cff
+build flags: Assistant Webapp Pairing S3(multipartupload)(storageclasses) WebDAV FsEvents ConcurrentOutput TorrentParser MagicMime Feeds Testsuite
+dependency versions: aws-0.17.1 bloomfilter-2.0.1.0 cryptonite-0.23 DAV-1.3.1 feed-0.3.12.0 ghc-8.0.2 http-client-0.5.7.0 persistent-sqlite-2.6.2 torrent-10000.1.1 uuid-1.3.13 yesod-1.4.5
+key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL
+remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar hook external
+operating system: darwin x86_64
+supported repository versions: 3 5 6
+upgrade supported from repository versions: 0 1 2 3 4 5
+local repository version: 5
+ewen@ashram:~/Music/podcasts$ 
+
+ + + +### Please provide any additional information below. + +Working request (created with `wget`, then replicated with `telnet`): + +
+ewen@ashram:~$ telnet 198.185.159.144 80
+Trying 198.185.159.144...
+Connected to 198.185.159.144.
+Escape character is '^]'.
+GET /episodes?format=RSS HTTP/1.1
+User-Agent: Wget/1.19.5 (darwin15.6.0)
+Accept: */*
+Accept-Encoding: identity
+Host: theythempodcast.com
+Connection: Keep-Alive
+Range: bytes=0-
+
+HTTP/1.1 301 Moved Permanently
+Date: Sun, 15 Jul 2018 02:02:53 GMT
+X-ServedBy: web012
+Location: https://theythempodcast.com/episodes?format=RSS
+Transfer-Encoding: chunked
+x-contextid: ZBu72tcg/OJSeq1q9
+x-via: 1.1 echo017
+
+0
+
+^]
+telnet> quit
+Connection closed.
+ewen@ashram:~$ 
+
+ +Failing request (as sent by `git-annex importfeed`): + + ewen@ashram:~$ telnet 198.185.159.144 80 + Trying 198.185.159.144... + Connected to 198.185.159.144. + Escape character is '^]'. + GET /episodes?format=RSS HTTP/1.1 + Host: theytheypodcast.com + Range: bytes=0- + Accept-Encoding: identity + + HTTP/1.1 400 Bad Request + content-length: 378 + x-synthetic: true + expires: Thu, 01 Jan 1970 00:00:00 UTC + pragma: no-cache + cache-control: no-cache, must-revalidate + content-type: text/html; charset=UTF-8 + connection: close + date: Sun, 15 Jul 2018 02:01:14 UTC + x-contextid: 1WIJdGgE/Y1Tq8lJu + x-via: 1.1 echo008 + + + + 400 Bad Request + + + +

400 Bad Request

+

1WIJdGgE/Y1Tq8lJu @ Sun, 15 Jul 2018 02:01:14 GMT
+

SEC-43
+


+    
+    Connection closed by foreign host.
+    ewen@ashram:~$ 
+
+Also failing request, with minimal HTTP/1.1 headers:
+
+    ewen@ashram:~$ telnet 198.185.159.144 80
+    Trying 198.185.159.144...
+    Connected to 198.185.159.144.
+    Escape character is '^]'.
+    GET /episodes?format=RSS HTTP/1.1
+    Host: theythempodcast.com
+
+    HTTP/1.1 400 Bad Request
+    content-length: 378
+    x-synthetic: true
+    expires: Thu, 01 Jan 1970 00:00:00 UTC
+    pragma: no-cache
+    cache-control: no-cache, must-revalidate
+    content-type: text/html; charset=UTF-8
+    connection: close
+    date: Sun, 15 Jul 2018 02:07:14 UTC
+    x-contextid: c05RnFlG/h78z0OqB
+    x-via: 1.1 echo034
+
+    
+    
+    400 Bad Request
+    
+    
+    
+    

400 Bad Request

+

c05RnFlG/h78z0OqB @ Sun, 15 Jul 2018 02:07:14 GMT
+

SEC-43
+


+    
+    Connection closed by foreign host.
+    ewen@ashram:~$ 
+
+Apparently minimal working request:
+
+    ewen@ashram:~$ telnet 198.185.159.144 80
+    Trying 198.185.159.144...
+    Connected to 198.185.159.144.
+    Escape character is '^]'.
+    GET /episodes?format=RSS HTTP/1.1
+    User-Agent: Wget/1.19.5 (darwin15.6.0)
+    Host: theythempodcast.com
+
+    HTTP/1.1 301 Moved Permanently
+    Date: Sun, 15 Jul 2018 02:09:09 GMT
+    X-ServedBy: web025
+    Location: https://theythempodcast.com/episodes?format=RSS
+    Transfer-Encoding: chunked
+    x-contextid: veJm4JC0/sT3ENh27
+    x-via: 1.1 echo017
+
+    0
+
+    ^]
+    telnet> quit
+    Connection closed.
+    ewen@ashram:~$ 
+
+Note that the only addition was adding a `User-Agent:` header (copied from the `wget` request).  The redirect to HTTPS is expected here, as we are initially requesting via HTTP for `telnet`s benefit (I did try to get `openssl s_client` working, but it appeared the TLS connection timed out before I could manually complete a request, and I had no luck piping the request into `openssl s_client`).
+
+### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
+
+Absolutely :-)  `git annex` has been my podcatcher, and media management tool for years.  Thanks for writing it!

From eb48ee9bfe2cefbf8e3b8841e30b0ceb9cd072c6 Mon Sep 17 00:00:00 2001
From: ewen 
Date: Sun, 15 Jul 2018 02:17:52 +0000
Subject: [PATCH 2/4] rename
 bugs/importfeed_bad_request_for___34__episodes__63__format__61__RSS__34__.mdwn
 to bugs/importfeed_bad_request_without_User-Agent__58__.mdwn

---
 ....mdwn => importfeed_bad_request_without_User-Agent__58__.mdwn} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename doc/bugs/{importfeed_bad_request_for___34__episodes__63__format__61__RSS__34__.mdwn => importfeed_bad_request_without_User-Agent__58__.mdwn} (100%)

diff --git a/doc/bugs/importfeed_bad_request_for___34__episodes__63__format__61__RSS__34__.mdwn b/doc/bugs/importfeed_bad_request_without_User-Agent__58__.mdwn
similarity index 100%
rename from doc/bugs/importfeed_bad_request_for___34__episodes__63__format__61__RSS__34__.mdwn
rename to doc/bugs/importfeed_bad_request_without_User-Agent__58__.mdwn

From f6cc088eecd55b57687138221b04579e079e81e7 Mon Sep 17 00:00:00 2001
From: ewen 
Date: Sun, 15 Jul 2018 02:18:46 +0000
Subject: [PATCH 3/4] Renamed bug to indicate missing User-Agent header seems
 to be cause.

---
 doc/bugs/importfeed_bad_request_without_User-Agent__58__.mdwn | 2 --
 1 file changed, 2 deletions(-)

diff --git a/doc/bugs/importfeed_bad_request_without_User-Agent__58__.mdwn b/doc/bugs/importfeed_bad_request_without_User-Agent__58__.mdwn
index 27933ceefd..f3aaaffc36 100644
--- a/doc/bugs/importfeed_bad_request_without_User-Agent__58__.mdwn
+++ b/doc/bugs/importfeed_bad_request_without_User-Agent__58__.mdwn
@@ -10,8 +10,6 @@ As best I can tell from packet captures and `telnet` debugging, the issue is tha
 
 Can a suitable `User-Agent:` header be added to the `git annex importfeed` HTTP requests?
 
-(*Note* the bug title probably needs updating since I seem to have discovered the root cause after starting creating the bug.)
-
 ### What steps will reproduce the problem?
 
 Attempt to `git annex importfeed` the feed `'https://theythempodcast.com/episodes?format=RSS'`.  The problem also appears reproducible with `'http://theythempodcast.com/episodes?format=RSS'` (ie, unencrypted), which makes debugging a wee bit simpler (ie, packet captures can help).

From 2b8f8838b5fb4483a50d5c9870c3c872ffb2f3f0 Mon Sep 17 00:00:00 2001
From: ewen 
Date: Sun, 15 Jul 2018 02:21:48 +0000
Subject: [PATCH 4/4] Added a comment: User-Agent

---
 ...comment_1_4412b2d9f4ed59ff078c8efd9ae6e13f._comment | 10 ++++++++++
 1 file changed, 10 insertions(+)
 create mode 100644 doc/bugs/importfeed_bad_request_without_User-Agent__58__/comment_1_4412b2d9f4ed59ff078c8efd9ae6e13f._comment

diff --git a/doc/bugs/importfeed_bad_request_without_User-Agent__58__/comment_1_4412b2d9f4ed59ff078c8efd9ae6e13f._comment b/doc/bugs/importfeed_bad_request_without_User-Agent__58__/comment_1_4412b2d9f4ed59ff078c8efd9ae6e13f._comment
new file mode 100644
index 0000000000..322506b8a4
--- /dev/null
+++ b/doc/bugs/importfeed_bad_request_without_User-Agent__58__/comment_1_4412b2d9f4ed59ff078c8efd9ae6e13f._comment
@@ -0,0 +1,10 @@
+[[!comment format=mdwn
+ username="ewen"
+ avatar="http://cdn.libravatar.org/avatar/605b2981cb52b4af268455dee7a4f64e"
+ subject="User-Agent"
+ date="2018-07-15T02:21:47Z"
+ content="""
+In theory `User-Agent:` is \"optional\", but [clients `SHOULD` send it](https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.43).  Looks like some servers are particularly fussy about it being sent... and off the top of my head I can't think of a reason *not* to send a `User-Agent:` header of some sort, as it is *very* widely sent.
+
+Ewen
+"""]]