MetaField was already limited to alphanumerics, so it makes sense to use
Text for it.
Note that technically a UUID can contain invalid UTF-8, and so
remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8
values with '?' or whatever. In practice, a UUID is usually also text,
I only kept open the possibility of it containing invalid UTF-8 to avoid
breaking parsing of strange UUIDs in git-annex branch files. So, I
decided to let this edge case slip by.
Have not updated the rest of the code base yet for this change, as the
change took 2.5 hours longer than I expected to get working properly.
This is not as efficient as using ByteStrings throughout, but converting
the String to ByteString is actually significantly faster than the old
parser.
benchmarking parse/old
time 9.657 μs (9.600 μs .. 9.732 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 9.703 μs (9.645 μs .. 9.785 μs)
std dev 231.6 ns (161.5 ns .. 323.7 ns)
variance introduced by outliers: 25% (moderately inflated)
benchmarking parse/new
time 834.6 ns (797.1 ns .. 886.9 ns)
0.987 R² (0.976 R² .. 0.999 R²)
mean 816.4 ns (802.7 ns .. 845.1 ns)
std dev 62.39 ns (37.66 ns .. 108.4 ns)
variance introduced by outliers: 82% (severely inflated)
There is a small behavior change from the old parsePOSIXTime,
which accepted any amount of trailing whitespace after the timestamp.
That behavior was not documented, and it doesn't seem anything relied on it.
ghc warned that the guards did not cover all values of h, but they
clearly do, and when rewritten as a case statement the warning goes
away.
Probably a ghc bug, but I kind of prefer the case statement over the
guards anyway.
Deleting directories is one of the great unsolved problems of CS, thanks to
abominations like NFS lock files and Windows and races with other processes
cleaning up after themselves in the background. The gpg test harness
sometimes failed to delete its temp directory on NFS. Avoid the problem
class by not deleting it at all, and putting it inside the tmp repo being
tested. The test suite's more robust (and/or nonsensical) workarounds for
deleting its test dir will thus be used, hopefully avoiding the problem
until an OS finds a new way to violate POSIX and the laws of nature.
Note that this means that the .gnupg directory will be on whatever
filesystem the test suite is being run on, which may be a lesser quality
filesystem than gpg is really expecting. Gpg does not seem to need to
write sockets etc to there so this seems ok. The only known problem is
that if the filesystem forces a directory mode like 777, gpg will warn
about unsafe home directory perms, but it still works.
Don't much like that there's no way to distinguish between having the whole
content and having an old version of the file that's bigger, but of course
resuming a http transfer can always yield the wrong result if the file on
the http server is changing, and git-annex will detect that when it
verifies the downloaded content.
This work is supported by the NIH-funded NICEMAN (ReproNim TR&D3) project.
Cache high-resolution mtimes for improved detection of modified files in v7
(and direct mode).
Including on Windows.
With back-compat support so old low-res mtimes won't break anything, and
so the new information also won't break old versions of git-annex.
Luckily, this did not affect any git-annex log files, since they all
include the trailing 's' for backwards compatability reasons.
But, if I later want to drop that, this is the first commit where
git-annex can be trusted to parse that right.
The misparse caused it to be off by up to 10 seconds.
I've seen intermittent failures of the test suite with v6 for a long time,
it seems to have possibly gotten worse with the changes around v7. Or just
being unlucky; all tests failed today.
Seen on amd64 and i386 builders, repeatedly but intermittently:
unused: FAIL (4.86s)
Test.hs:928:
git diff did not show changes to unlocked file
And I think other such failures, all involving v7/v6 mode tests.
I managed to reproduce the unused failure with --keep-failures,
and inside the repo, git diff was indeed not showing any changes for
the modified unlocked file.
The two stats will be the same other than mtime; the old and new files have
the same size and inode, since the test case writes to the file and then
overwrites it.
Indeed, notice the identical timestamps:
builder@orca:~/gitbuilder/build/.t/tmprepo335$ echo 1 > foo; stat foo; echo 2 > foo; stat foo
File: foo
Size: 2 Blocks: 8 IO Block: 4096 regular file
Device: 801h/2049d Inode: 3546179 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ builder) Gid: ( 1000/ builder)
Access: 2018-10-29 22:14:10.894942036 +0000
Modify: 2018-10-29 22:14:10.894942036 +0000
Change: 2018-10-29 22:14:10.894942036 +0000
Birth: -
File: foo
Size: 2 Blocks: 8 IO Block: 4096 regular file
Device: 801h/2049d Inode: 3546179 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ builder) Gid: ( 1000/ builder)
Access: 2018-10-29 22:14:10.894942036 +0000
Modify: 2018-10-29 22:14:10.898942036 +0000
Change: 2018-10-29 22:14:10.898942036 +0000
Birth: -
I'm seeing this in Linux VMs; it doesn't happen on my laptop. I've also
not experienced the intermittent test suite failures on my laptop.
So, I hope that this small delay will avoid the problem.
Update: I didn't, indeed I then reproduced the same failure on my
laptop, so it must be due to something else. But keeping this change anyway
since not needing to worry about lowish-resolution mtime in the test suite seems
worthwhile.
Install new git hooks in this version.
This does beg the question of what to do if git later gets eg a
post-smudge hook, that could run git-annex smudge --update. I think the
thing to do in that case would be to make git-annex smudge --update
install the new hooks. That way, as the user uses git-annex, the hook
would be created pretty quickly and without needing any extra syscalls
except for when git-annex smudge --update is called.
I considered doing something like that for installation of the
post-checkout and post-merge hooks, which would have avoided the need
for v7. But the only place it was cheap to do it would be in git-annex smudge
which could cheaply notice that smudge.log didn't exist yet and so know
the hooks needed to be installed. But since smudge used to populate pointer
files, it would be quite surprising if a single git checkout/merge failed
to update the work tree, and so that idea didn't work out.
The other reason for v7 is psychological -- users don't need to worry
about whether they might be running an old version of git-annex that
doesn't support their v7 repository very well. And bug reports about
"v6" have gotten a bit of a bad association in my head since they often
hit one of the known limitations and didn't realize it was experimental.
newtyped RepoVersion Int to avoid needing 2 comparisons in
versionSupportsUnlockedPointers etc. Also it's just nicer.
This commit was sponsored by John Pellman on Patreon.
Running git-annex linux builds in termux seems to work well enough that the
only reason to keep the Android app would be to support Android 4-5, which
the old Android app supported, and which I don't know if the termux method
works on (although I see no reason why it would not).
According to [1], Android 4-5 remains on around 29% of devices, down from
51% one year ago.
[1] https://www.statista.com/statistics/271774/share-of-android-platforms-on-mobile-devices-with-android-os/
This is a rather large commit, but mostly very straightfoward removal of
android ifdefs and patches and associated cruft.
Also, removed support for building with very old ghc < 8.0.1, and with
yesod < 1.4.3, and without concurrent-output, which were only being used
by the cross build.
Some documentation specific to the Android app (screenshots etc) needs
to be updated still.
This commit was sponsored by Brett Eisenberg on Patreon.
The error message displayed used to only come from curl/wget and perhaps
was clearer than the one displayed now that http-client is used. In any
case, it does make sense to hide it because git-annex prints its own
warning message.
This commit was sponsored by Jake Vosloo on Patreon.
download is documented as displaying an error when download fails, but
it didn't when the url was not valid at all. That leads to confusing
behavior.
Also, display the url with --debug
Untested, on FreeBSD but enough to fix the listed build errors.
Seems that System.Posix.Files must have used to export this stuff and it
was split.
This commit was sponsored by Peter on Patreon.
Work around git cat-file --batch's protocol not supporting newlines by
running git cat-file not batched and passing the filename as a
parameter.
Of course this is quite a lot less efficient, especially because it
currently runs it multiple times to query for different pieces of
information.
Also, it has subtly different behavior when the batch process was
started and then some changes were made, in which case the batch process
sees the old index but this workaround sees the current index. Since
that batch behavior is mostly a problem that affects the assistant and has
to be worked around in it, I think I can get away with this difference.
I don't know of any other problems with newlines in filenames, everything
else in git I can think of supports -z. And git-annex's json output
supports newlines in filenames so downstream parsers from git-annex will be ok.
git-annex commands that use --batch themselves don't support newlines
in input filenames; using --json --batch is currently a way around that
problem.
This commit was sponsored by Ewen McNeill on Patreon.
When git-annex used wget and curl, --debug would show urls. So there can't
be any new security problem with doing so.
This commit was sponsored by John Pellman on Patreon.
Avoids "git-annex-shell: <stdin>: hGetChar: end of file"
being displayed by the test suite, due to the way it
runs git-annex-shell without using ssh.
git-annex-shell over ssh was not affected because git-annex hangs up the
ssh connection and so never sees the error message that git-annnex-shell
probably did emit.
This commit was sponsored by Ryan Newton on Patreon.
In 2013, I wrote "Cryptohash benchmarks 90 to 101% faster than external
hashers". Re-benchmarking today, I found cryptonite's sha256 consistently
outperformed coreutils by 10% for large files. Tested 10 mb, 100 mb, 1 gb
files with both sha256 and sha512. And for smaller files, the external
process startup time swamps the hash time.
Perhaps cryptonite has improved. Or it could just do better on my
current CPU Intel(R) Pentium(R) CPU 4410Y @ 1.50GHz). Anyway, even if cryptonite
is slower in some situations, seems likely it would only be marginally slower;
it's got the same class of highly optimised C code under the hood as coreutils.
The main difference between the two sha256 implementations seems to be
how much of the inner loop they unroll..
This commit was sponsored by Henrik Riomar on Patreon.
Switched code to use a for loop to avoid a filterM that would have
doubled the memory used.
This commit was supported by the NSF-funded DataLad project.
Switch to using http-client for large file downloads caused the reversion;
the code for displaying a 404 response was instead displaying the raw html
document, which is not useful.
This commit was sponsored by Ryan Newton on Patreon.
Send User-Agent and any configured annex.http-headers when downloading with
http, fixes reversion introduced when switching to http-client.
This commit was sponsored by mo on Patreon.
This avoid a build problem when different versions of posix and
posixcompat are used. Does not normally happen as cabal prevents that,
but this is sometimes used with ghc --make which can get into that
situation.
p2p --pair: Fix interception of the magic-wormhole pairing code, which
since 0.8.2 it has sent to stderr rather than stdout.
This is highly annoying because I had asked the magic wormhole developers
for a machine-readable way to get the data, and instead they changed how
the data was output, and didn't even mention this in my issue, or in the
changelog.
Seems this needs to be tested periodically to make sure it's still working.
This commit was sponsored by Ethan Aubin.
They're no worse than http certianly. And, the backport of these
security fixes has to deal with wget, which supports http https and ftp
and has no way to turn off individual schemes, so this will make that
easier.
A local http proxy would bypass the security configuration. So,
the security configuration has to be applied when choosing whether to
use the proxy.
While http rebinding attacks against the dns lookup of the proxy IP
address seem very unlikely, this implementation does prevent them, since
it resolves the IP address once, checks it, and then reconfigures
http-client's proxy using the resolved address.
This commit was sponsored by Ole-Morten Duesund on Patreon.
Security fix!
* git-annex will refuse to download content from http servers on
localhost, or any private IP addresses, to prevent accidental
exposure of internal data. This can be overridden with the
annex.security.allowed-http-addresses setting.
* Since curl's interface does not have a way to prevent it from accessing
localhost or private IP addresses, curl defaults to not being used
for url downloads, even if annex.web-options enabled it before.
Only when annex.security.allowed-http-addresses=all will curl be used.
Since S3 and WebDav use the Manager, the same policies apply to them too.
youtube-dl is not handled yet, and a http proxy configuration can bypass
these checks too. Those cases are still TBD.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
Would be nice to add CIDR notation to this, but this is the minimal
thing needed for the security fix.
This commit was sponsored by Ewen McNeill on Patreon.
For use in a security boundary enforcement.
Based on https://en.wikipedia.org/wiki/Reserved_IP_addresses
Including supporting IPv4 addresses embedded in IPv6 addresses. Because
while RFC6052 3.1 says "Address translators MUST NOT translate packets
in which an address is composed of the Well-Known Prefix and a non-
global IPv4 address; they MUST drop these packets", I don't want to
trust that implementations get that right when enforcing a security
boundary.
This commit was sponsored by John Pellman on Patreon.
This is a clean way to add IP address restrictions to http-client, and
any library using it.
See https://github.com/snoyberg/http-client/issues/354#issuecomment-397830259
Some code from http-client and http-client-tls was copied in and
modified. Credited its author accordingly, and used the same MIT license.
The restrictions don't apply to http proxies. If using http proxies is a
problem, http-client already has a way to disable them.
SOCKS support is not included. As far as I can tell, http-client-tls
does not support SOCKS by default, and so git-annex never has.
The additional dependencies are free; git-annex already transitively
depended on them via http-conduit.
This commit was sponsored by Eric Drechsel on Patreon.
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
Prevent haskell http-client from decompressing gzip files, so downloads of
such files works the same as it used to with wget and curl.
Explicitly setting accept-encoding to "identity" is probably not needed,
but that's what wget sends (curl does not send the header), and since
http-client is trying to be excessively smart, it seems we need to set
hAcceptEncoding to something to prevent it from inserting its own,
and this seems better than some hack like "".
This commit was sponsored by Ole-Morten Duesund on Patreon.
* Display error message when http download fails.
There's nothing in the http-client library to nicely format a http
exception, so in some cases it has to fall back to using show on it.
Seems better than just saying "it failed" or only showing the http
status code.
* Avoid forward retry when 0 bytes were received.
forwardRetry was comparing Nothing to Just 0, and so thought there had
been progress made when 0 bytes were received.
This commit was supported by the NSF-funded DataLad project.
Switch to Data.Map.Strict everywhere that used it.
There are still lots of lazy maps in git-annex. I think switching these
is safe. The risk is that there might be a map that is used in a way
that relies on the values not being evaluated to WHNF, and switching to
strict might result in bad performance or memory use. So, I have not
switched everything.
As long as all code imports Utility.Aeson rather than Data.Aeson,
and no Strings that may contain utf-8 characters are used for eg, object
keys via T.pack, this is guaranteed to fix the problem everywhere that
git-annex generates json.
It's kind of annoying to need to wrap ToJSON with a ToJSON', especially
since every data type that has a ToJSON instance has to be ported over.
However, that only took 50 lines of code, which is worth it to ensure full
coverage. I initially tried an alternative approach of a newtype FileEncoded,
which had to be used everywhere a String was fed into aeson, and chasing
down all the sites would have been far too hard. Did consider creating an
intentionally overlapping instance ToJSON String, and letting ghc fail
to build anything that passed in a String, but am not sure that wouldn't
pollute some library that git-annex depends on that happens to use ToJSON
String internally.
This commit was supported by the NSF-funded DataLad project.
* For url downloads, git-annex now defaults to using a http library,
rather than wget or curl. But, if annex.web-options is set, it will
use curl. To use the .netrc file, run:
git config annex.web-options --netrc
* git-annex no longer uses wget (and wget is no longer shipped with
git-annex builds).
Note that curl is always run in silent mode, since the new API for
download has a MeterUpdate and doesn't make way for curl progress
output. It might be worth writing a parser for curl's progress output
to update the meter when using it, but I didn't bother with this edge
case for now.
This commit was supported by the NSF-funded DataLad project.
Remote.S3 and Remote.Helper.Http both had similar code to sink a
http-conduit Response to a file; refactor out sinkResponseFile.
downloadC downloads an url to a file using http-conduit, and supports
resuming. Falls back to curl to handle urls that http-conduit does not
support. This is not used yet, but the goal is to replace download with
it.
git-annex.cabal: conduit-extra was not actually used for a long time,
remove the dep. conduit moves into the main dependency list, but since
http-conduit was already in there, and it depends on conduit, that's not
really adding a new build dep.
This commit was supported by the NSF-funded DataLad project.
Enable HTTP connection reuse across multiple files, when git-annex
uses http-conduit. Before, a new Manager was created each time
Utility.Url used it. Now, a single Manager gets created the first time,
so connections are reused.
Doesn't help when external programs are used for url download,
but does speed up addurl --fast, fsck --from web, etc.
Testing fsck --fast --from web with 3 files, over high-latency
satellite internet, it sped up from 19.37s to 14.96s.
This commit was supported by the NSF-funded DataLad project.
The pipe's FDs got inherited by ssh and it did something that kept them
open even once it exited. Probably involving passing them on to the ssh
mux daemon.
Set close on exec, and all is well.
Kept Annex.Ssh not using processTranscript even though it no longer
hangs when it does use it, just because processTranscript is overkill
there.
This commit was supported by the NSF-funded DataLad project.
This is much clearer to follow.
I've tested this, and it still has the problem described in
doc/bugs/occasional_hang_with_p2pstdio.mdwn
Which I think indicates that problem is not with my code, but something
else. ghc runtime? Something crazy ssh does in this situation? Unsure..
Some blake hash varieties were not yet available in that version.
Rather than tracking exact details of what cryptonite supported when,
disable blake unless using a current cryptonite.
There are a lot of different variants and sizes, I suppose we might as well
export all the common ones.
Bump dep to cryptonite to 0.16, earlier versions lacked BLAKE2 support.
Even android has 0.16 or newer.
On Debian, Blake2bp_512 is buggy, so I have omitted it for now.
http://bugs.debian.org/892855
This commit was sponsored by andrea rota.
Noticed that getting a key whose size is not known resulted in a
progress display that didn't include the percent complete.
Fixed for P2P by making the size sent with DATA be used to update the
meter's total size.
In order for rateLimitMeterUpdate to also learn the total size,
had to make it be passed the Meter, and some other reorg in
Utility.Metered was also done so that --json-progress can construct a
Meter to pass to rateLimitMeterUpdate.
When the fallback rsync is done, the progress display still doesn't
include the percent complete. Only way to fix that seems to be to let rsync
display its output again, but that would conflict with git-annex's
own progress meter, which is also being displayed.
This commit was sponsored by Henrik Riomar on Patreon.