Except when configuration makes curl be used. It did not seem worth
trying to tail the file when curl is downloading.
But when an interrupted download is resumed, it does not read the whole
existing file to hash it. Same reason discussed in
commit 7eb3742e4b76d1d7a487c2c53bf25cda4ee5df43; that could take a long
time with no progress being displayed. And also there's an open http
request, which needs to be consumed; taking a long time to hash the file
might cause it to time out.
Also in passing implemented it for git and external special remotes when
downloading from the web. Several others like S3 are within striking
distance now as well.
Sponsored-by: Dartmouth College's DANDI project
This uses a DebugSelector, rather than debug levels, which will allow
for a later option like --debug-from=Process to only
see debuging about running processes.
The module name that contains the thing being debugged is used as the
DebugSelector (in most cases; does not need to be a hard and fast rule).
Debug calls were changed to add that. hslogger did not display
that first parameter to debugM, but the DebugSelector does get
displayed.
Also fastDebug will allow doing debugging in places that are used in
tight loops, with the DebugSelector coming from the Annex Reader
essentially for free. Not done yet.
New error message:
Remote foo not usable by git-annex; setting annex-ignore
http://localhost/foo/config download failed: Configuration of annex.security.allowed-ip-addresses does not allow accessing address ::1
If git config parse fails, or the git config file is not available at the url,
a better error message for that is also shown.
This commit was sponsored by Mark Reidenbach on Patreon.
It had lost the hAcceptEncoding header that is set as part of the
overriding of http-client's default decompression of compressed files.
Seems likely that would have caused resuming of compressed files to fail
in some cases.
This commit was sponsored by Brett Eisenberg on Patreon.
Fix a bug that could make resuming a download from the web fail when the
entire content of the file is actually already present locally.
What a mess that Request can throw exceptions or not, depending on how
it's configured. Makes it very hard if you need to handle some specific
http status codes in a function like this! Implementing everything two
ways did not seem appealing, if possible at all, so I decided to
override the Request if it did come configured to throw exception on
non-2xx http status. Other exceptions, like from http-client-restricted,
or due to a redirect to a non-http url, still get thrown.
This commit was sponsored by Luke Shumaker on Patreon.
Lots of nice wins from this in avoiding unncessary work, and I think
nothing got slower.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Potentially fixes https://git-annex.branchable.com/bugs/concurrent_git-annex-copy_to_s3_special_remote_fails/
although I don't know if it does.
My thinking is, ResourceT may allocate a resource and then free it,
and a unforced thunk to that resource could result in reading memory
that has since been overwritten by something else, or in a SEGV,
depending. While that seems kind of like a bug in ResourceT to me, if it
is what's happening, this will avoid it. If it's not, this doesn't
really hurt much since the values are all smallish.
This commit was sponsored by Graham Spencer on Patreon.
Otherwise use the vendored copy as before.
The library is in Debian testing but not stable. Once it reaches
stable, the vendored copy can be removed.
Did not add it to debian/control because IIRC that's used to build
git-annex on stable too, possibly. However, the Debian maintainer will
probably want to make the package depend on libghc-http-client-restricted-dev
This commit was sponsored by Ilya Shlyakhter on Patreon.
addurl: When run with --fast on an url that
annex.security.allowed-ip-addresses prevents accessing, display a more
useful message.
(Also importfeed --fast potentially.)
If git-credential has it cached and does not prompt, this will
unfortunately result in a brief flicker, as the displayed console
regions are hidden while running it and then re-displayed. Better than a
corrupted display.
Actually, I tried it and don't see a visible flicker, so probably only
over a slow ssh will it be apparent.
using git credential to get the password
One thing this doesn't do is wrap the password prompting inside the prompt
action. So with -J, the output can be a bit garbled.
Eliminated some dead code. In other cases, exported a currently unused
function, since it was a logical part of the API.
Of course this improves the API documentation. It may also sometimes
let ghc optimize code better, since it can know a function is internal
to a module.
364 modules still to go, according to
git grep -E 'module [A-Za-z.]+ where'
Convert Utility.Url to return Either String so the error message can be
displated in the annex monad and so captured.
(When curl is used, its errors are still not caught.)
I made some improvements to its API after splitting it out of git-annex,
so merge those back in.
This is groundwork for removing the embedded copy of it and depending on
it.
Also moved the managerResponseTimeout disabling to Annex.Url as it's
git-annex specific.
This commit was sponsored by Ethan Aubin on Patreon.
Drop support for building with ghc older than 8.4.4, and with older
versions of serveral haskell libraries than will be included in Debian 10.
The only remaining version ifdefs in the entire code base are now a couple
for aws!
This commit should only be merged after the Debian 10 release.
And perhaps it will need to wait longer than that; it would make
backporting new versions of git-annex to Debian 9 (stretch) which
has been actively happening as recently as this year.
This commit was sponsored by Ilya Shlyakhter.
When downloading an url and the destination file exists but is empty,
avoid using http range to resume, since a range "bytes=0-" is an unusual
edge case that it's best to avoid relying on working.
This is known to fix a case where importfeed downloaded a partial feed from
such a server. Since importfeed uses withTmpFile, the destination always exists
empty, so it would particularly tickle such problem servers. Resuming from 0
is otherwise possible, but unlikely.
Add back support for ftp urls, which was disabled as part of the fix for
security hole CVE-2018-10857 (except for configurations which enabled curl
and bypassed public IP address restrictions). Now it will work if allowed
by annex.security.allowed-ip-addresses.
Don't much like that there's no way to distinguish between having the whole
content and having an old version of the file that's bigger, but of course
resuming a http transfer can always yield the wrong result if the file on
the http server is changing, and git-annex will detect that when it
verifies the downloaded content.
This work is supported by the NIH-funded NICEMAN (ReproNim TR&D3) project.
The error message displayed used to only come from curl/wget and perhaps
was clearer than the one displayed now that http-client is used. In any
case, it does make sense to hide it because git-annex prints its own
warning message.
This commit was sponsored by Jake Vosloo on Patreon.
download is documented as displaying an error when download fails, but
it didn't when the url was not valid at all. That leads to confusing
behavior.
Also, display the url with --debug
When git-annex used wget and curl, --debug would show urls. So there can't
be any new security problem with doing so.
This commit was sponsored by John Pellman on Patreon.
Switch to using http-client for large file downloads caused the reversion;
the code for displaying a 404 response was instead displaying the raw html
document, which is not useful.
This commit was sponsored by Ryan Newton on Patreon.
Send User-Agent and any configured annex.http-headers when downloading with
http, fixes reversion introduced when switching to http-client.
This commit was sponsored by mo on Patreon.
They're no worse than http certianly. And, the backport of these
security fixes has to deal with wget, which supports http https and ftp
and has no way to turn off individual schemes, so this will make that
easier.
Security fix!
* git-annex will refuse to download content from http servers on
localhost, or any private IP addresses, to prevent accidental
exposure of internal data. This can be overridden with the
annex.security.allowed-http-addresses setting.
* Since curl's interface does not have a way to prevent it from accessing
localhost or private IP addresses, curl defaults to not being used
for url downloads, even if annex.web-options enabled it before.
Only when annex.security.allowed-http-addresses=all will curl be used.
Since S3 and WebDav use the Manager, the same policies apply to them too.
youtube-dl is not handled yet, and a http proxy configuration can bypass
these checks too. Those cases are still TBD.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
Prevent haskell http-client from decompressing gzip files, so downloads of
such files works the same as it used to with wget and curl.
Explicitly setting accept-encoding to "identity" is probably not needed,
but that's what wget sends (curl does not send the header), and since
http-client is trying to be excessively smart, it seems we need to set
hAcceptEncoding to something to prevent it from inserting its own,
and this seems better than some hack like "".
This commit was sponsored by Ole-Morten Duesund on Patreon.
* Display error message when http download fails.
There's nothing in the http-client library to nicely format a http
exception, so in some cases it has to fall back to using show on it.
Seems better than just saying "it failed" or only showing the http
status code.
* Avoid forward retry when 0 bytes were received.
forwardRetry was comparing Nothing to Just 0, and so thought there had
been progress made when 0 bytes were received.
This commit was supported by the NSF-funded DataLad project.
* For url downloads, git-annex now defaults to using a http library,
rather than wget or curl. But, if annex.web-options is set, it will
use curl. To use the .netrc file, run:
git config annex.web-options --netrc
* git-annex no longer uses wget (and wget is no longer shipped with
git-annex builds).
Note that curl is always run in silent mode, since the new API for
download has a MeterUpdate and doesn't make way for curl progress
output. It might be worth writing a parser for curl's progress output
to update the meter when using it, but I didn't bother with this edge
case for now.
This commit was supported by the NSF-funded DataLad project.