This fixes a reversion in the ByteString conversion. The old code used
isSpace to decide when the metadata value needs to be base64 encoded,
and that incorrectly changed to only checking if it contained ' '.
Note that only '\n' and '\r' were added and not other sorts of
whitespace that isSpace matches, like '\t' and '\v'. Only the former
would cause problems.
Purifying exportActions will allow introspecting and modifying it,
which is needed to add progress bar display to it.
Only S3 and WebDAV ran an Annex action while constructing ExportActions.
There was a small performance gain from them doing that, since a
resource was able to be prepared and reused for multiple actions by
Command.Export.
As seen in commit 809cfbbd8a and
5d394023eb S3 and WebDAV actually create a
new handle for each access in normal, non-export use. It doesn't seem
worth making export use of them marginally more efficient than normal
use. It would be better to do that work upfront when constructing the
remote. Or perhaps use a MVar to cache a handle.
This commit was sponsored by Nick Piper on Patreon.
* Switch to using .git/annex/othertmp for tmp files other than partial
downloads, and make stale files left in that directory when git-annex
is interrupted be cleaned up promptly by subsequent git-annex processes.
* The .git/annex/misctmp directory is no longer used and git-annex will
delete anything lingering in there after it's 1 week old.
Also, in Annex.Ingest, made the filename it uses in the tmp dir be
prefixed with "ingest-" to avoid potentially using a filename used by
some other code.
Adding that field broke the Read/Show serialization back-compat,
and also the Eq and Ord instances were not blinded to it, which broke
git annex fsck and probably more.
I think that the new approach used in formatKeyVariety will be nearly
as fast, but have not benchmarked it.
This reverts commit 4536c93bb2.
That broke Read/Show of a Key, and unfortunately Key is read in at least
one place; the GitAnnexDistribution data type.
It would be worth bringing this optimisation back, but it would need
either a custom Read/Show instance that preserves back-compat, or
wrapping Key in a data type that contains the serialization, or changing
how GitAnnexDistribution is serialized.
Also, the Eq instance would need to compare keys with and without a
cached seralization the same.
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
It means that every place a Key has any of its fields changed, the cache
has to be dropped. I've grepped and found them all. But, it would be
better to avoid that gotcha somehow..
What these generate is not really suitable to be used as a filename,
which is why keyFile and fileKey further escape it. These are just
serializing Keys.
Also removed a quickcheck test that was very unlikely to test anything
useful, since it relied on random chance creating something that looks
like a serialized key. The other test is sufficient for testing what
that was intended to test anyway.
The new parser is significantly stricter than the old one:
The old file2key allowed the fields to come in any order,
but the new one requires the fixed order that git-annex has always used.
Hopefully this will not cause any breakage.
And the old file2key allowed eg SHA1-m1-m2-m3-m4-m5-m6--xxxx
while the new does not allow duplication of fields. This could potentially
improve security, because allowing lots of extra junk like that in a key
could potentially be used in a SHA1 collision attack, although the current
attacks need binary data and not this kind of structured numeric data.
Speed improved of course, and fairly substantially, in microbenchmarks:
benchmarking old/key2file
time 2.264 μs (2.257 μs .. 2.273 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.265 μs (2.260 μs .. 2.275 μs)
std dev 21.17 ns (13.06 ns .. 39.26 ns)
benchmarking new/key2file'
time 1.744 μs (1.741 μs .. 1.747 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.745 μs (1.742 μs .. 1.751 μs)
std dev 13.55 ns (9.099 ns .. 21.89 ns)
benchmarking old/file2key
time 6.114 μs (6.102 μs .. 6.129 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 6.118 μs (6.106 μs .. 6.143 μs)
std dev 55.00 ns (30.08 ns .. 100.2 ns)
benchmarking new/file2key'
time 1.791 μs (1.782 μs .. 1.801 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 1.792 μs (1.785 μs .. 1.804 μs)
std dev 32.46 ns (20.59 ns .. 50.82 ns)
variance introduced by outliers: 19% (moderately inflated)
This preserves the workaround for the old bug that caused NoUUID items
to be stored in the log, prefixing log lines with " ". It's now handled
implicitly, by using takeWhile1 (/= ' ') to get the uuid.
There is a behavior change from the old parser, which split the value
into words and then recombined it. That meant that "foo bar" and "foo\tbar"
came out as "foo bar". That behavior was not documented, and seems
surprising; it meant that after a git-annex describe here "foo bar",
you wouldn't get that same string back out when git-annex displayed repo
descriptions.
Otoh, some other parsers relied on the old behavior, and the attoparsec
rewrites had to deal with the issue themselves...
For group.log, there are some edge cases around the user providing a
group name with a leading or trailing space. The old parser would ignore
such excess whitespace. The new parser does too, because the alternative
is to refuse to parse something like " group1 group2 " due to excess
whitespace, which would be even more confusing behavior.
The only git-annex branch log file that is not converted to attoparsec
and bytestring-builder now is transitions.log.
Mostly didn't push the ByteStrings down very deep, but all of these log
files are not written to frequently at all, so slight remaining
innefficiency doesn't matter.
In Logs.UUID, removed the fixBadUUID code that cleaned up after a bug in
git-annex versions 3.20111105-3.20111110. In the unlikely event that a repo was
last touched by that ancient git-annex version, the descriptions of remotes
would appear missing when used with this version of git-annex. That is such minor
breakage, and so unlikely to still be a problem for any repos, that it was not
worth forward-porting that code to ByteString.
MetaField was already limited to alphanumerics, so it makes sense to use
Text for it.
Note that technically a UUID can contain invalid UTF-8, and so
remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8
values with '?' or whatever. In practice, a UUID is usually also text,
I only kept open the possibility of it containing invalid UTF-8 to avoid
breaking parsing of strange UUIDs in git-annex branch files. So, I
decided to let this edge case slip by.
Have not updated the rest of the code base yet for this change, as the
change took 2.5 hours longer than I expected to get working properly.
This should make == comparison of UUIDs somewhat faster, and perhaps a
few other operations around maps of UUIDs etc.
FromUUID/ToUUID are used to convert String, which is still used for all
IO of UUIDs. Eventually the hope is those instances can be removed,
and all git-annex branch log files etc use ByteString throughout, for a
real speed improvement.
Note the use of fromRawFilePath / toRawFilePath -- while a UUID usually
contains only alphanumerics and so could be treated as ascii, it's
conceivable that some git-annex repository has been initialized using
a UUID that is not only not a canonical UUID, but contains high unicode
or invalid unicode. Using the filesystem encoding avoids any problems
with such a thing. However, a NUL in a UUID seems extremely unlikely,
so I didn't use encodeBS / decodeBS to avoid their extra overhead in
handling NULs.
The Read/Show instance for UUID luckily serializes the same way for
ByteString as it did for String.
* findref: Support file matching options: --include, --exclude,
--want-get, --want-drop, --largerthan, --smallerthan, --accessedwithin
* Commands supporting --branch now apply file matching options --include,
--exclude, --want-get, --want-drop to filenames from the branch.
Previously, combining --branch with those would fail to match anything.
* add, import, findref: Support --time-limit.
This commit was sponsored by Jake Vosloo on Patreon.
Note that it does not prevent storing p2p access tokens or multicast
encryption keys, since those are not cached; the previous commit
established the distinction.
How well this works depends on how often getRemoteCredPair is called and
how expensive it is. In some cases setting this will result in an annoying
number of gpg password prompts and/or slowdowns due to reading creds
from the git-annex branch and decrypting, which could be improved by calling
getRemoteCredPair less often.
This commit was sponsored by Ilya Shlyakhter on Patreon.
Install new git hooks in this version.
This does beg the question of what to do if git later gets eg a
post-smudge hook, that could run git-annex smudge --update. I think the
thing to do in that case would be to make git-annex smudge --update
install the new hooks. That way, as the user uses git-annex, the hook
would be created pretty quickly and without needing any extra syscalls
except for when git-annex smudge --update is called.
I considered doing something like that for installation of the
post-checkout and post-merge hooks, which would have avoided the need
for v7. But the only place it was cheap to do it would be in git-annex smudge
which could cheaply notice that smudge.log didn't exist yet and so know
the hooks needed to be installed. But since smudge used to populate pointer
files, it would be quite surprising if a single git checkout/merge failed
to update the work tree, and so that idea didn't work out.
The other reason for v7 is psychological -- users don't need to worry
about whether they might be running an old version of git-annex that
doesn't support their v7 repository very well. And bug reports about
"v6" have gotten a bit of a bad association in my head since they often
hit one of the known limitations and didn't realize it was experimental.
newtyped RepoVersion Int to avoid needing 2 comparisons in
versionSupportsUnlockedPointers etc. Also it's just nicer.
This commit was sponsored by John Pellman on Patreon.
Both Command.Sync and Annex.Ingest had their own versions of this.
The one in Annex.Ingest used Git.Branch.currentUnsafe, but does not seem
to need it. That is only checking to see if it's in an adjusted unlocked
branch, and when in an adjusted branch, the branch does in fact exist,
so the added check that Git.Branch.current does is fine.
This commit was sponsored by Denis Dzyubenko on Patreon.
Running git-annex linux builds in termux seems to work well enough that the
only reason to keep the Android app would be to support Android 4-5, which
the old Android app supported, and which I don't know if the termux method
works on (although I see no reason why it would not).
According to [1], Android 4-5 remains on around 29% of devices, down from
51% one year ago.
[1] https://www.statista.com/statistics/271774/share-of-android-platforms-on-mobile-devices-with-android-os/
This is a rather large commit, but mostly very straightfoward removal of
android ifdefs and patches and associated cruft.
Also, removed support for building with very old ghc < 8.0.1, and with
yesod < 1.4.3, and without concurrent-output, which were only being used
by the cross build.
Some documentation specific to the Android app (screenshots etc) needs
to be updated still.
This commit was sponsored by Brett Eisenberg on Patreon.
Added annex.jobs setting, which is like using the -J option.
Of course, -J overrides annex.jobs.
This commit was sponsored by Trenton Cronholm on Patreon.
Added remote.name.annex-security-allow-unverified-downloads, a per-remote
setting for annex.security.allow-unverified-downloads.
This commit was sponsored by Brock Spratlen on Patreon.
Added annex.maxextensionlength for use cases where extensions longer than 4
characters are needed.
This commit was sponsored by Henrik Riomar on Patreon.
Since the same key can be stored in a versioned S3 bucket multiple times
with different version IDs, this allows tracking them all. Not currently
needed, but if we ever want to drop from a versioned S3 bucket, we'll
need to know them all.
This commit was supported by the NSF-funded DataLad project.
Actually very straightforward reuse of the metadata log file code.
Although I had to add a todo item as git-annex forget won't clean up
dead remote's metadata yet.
This would be worth adding to the external special remote interface
sometime. Have not opened a todo though, guess I'll wait until something
needs it.
This commit was supported by the NSF-funded DataLad project.
Make `git annex export` check appendonly when removing a file from an
export, and not update the location log, since the remote still contains
the content.
This commit was supported by the NSF-funded DataLad project.
Does nothing yet.
Considered making bup readonly, but while the content can't be removed,
it is able to delete a branch, so didn't.
This commit was supported by the NSF-funded DataLad project.
It was sorting by uuid, rather than cost!
Avoid future bugs of this kind by changing the Ord to primarily compare
by cost, with uuid only used when the cost is the same.
This commit was supported by the NSF-funded DataLad project.
Added annex.commitmessage config that can specify a commit message for the
git-annex branch instead of the usual "update".
This commit was supported by the NSF-funded DataLad project.
Added remote.name.annex-speculate-present config that can be used to
make cache remotes.
Implemented it in Remote.keyPossibilities, which is used by the
get/move/copy/mirror commands, and nothing else. This way, things like
whereis will not show content that's speculatively present.
The assistant and sync --content were not using Remote.keyPossibilities,
and were changed to use it.
The efficiency hit should be small; Remote.keyPossibilities is only
used before transferring a file, which is the expensive operation.
And, it's only doing one lookup of the remoteList and a very cheap
filter over it.
Note that, git-annex still updates the location log when copying content
to a remote with annex-speculate-present set. In this case, the location
tracking will indicate that content is present in the remote. This may
not be wanted for caches, or may not be a real problem for them. TBD.
This commit was supported by the NSF-funded DataLad project.
Leveraged the existing verification code by making it also check the
retrievalSecurityPolicy.
Also, prevented getViaTmp from running the download action at all when the
retrievalSecurityPolicy is going to prevent verifying and so storing it.
Added annex.security.allow-unverified-downloads. A per-remote version
would be nice to have too, but would need more plumbing, so KISS.
(Bill the Cat reference not too over the top I hope. The point is to
make this something the user reads the documentation for before using.)
A few calls to verifyKeyContent and getViaTmp, that don't
involve downloads from remotes, have RetrievalAllKeysSecure hard-coded.
It was also hard-coded for P2P.Annex and Command.RecvKey,
to match the values of the corresponding remotes.
A few things use retrieveKeyFile/retrieveKeyFileCheap without going
through getViaTmp.
* Command.Fsck when downloading content from a remote to verify it.
That content does not get into the annex, so this is ok.
* Command.AddUrl when using a remote to download an url; this is new
content being added, so this is ok.
This commit was sponsored by Fernando Jimenez on Patreon.
This will be used to protect against CVE-2018-10859, where an encrypted
special remote is fed the wrong encrypted data, and so tricked into
decrypting something that the user encrypted with their gpg key and did
not store in git-annex.
It also protects against CVE-2018-10857, where a remote follows a http
redirect to a file:// url or to a local private web server. While that's
already been prevented in git-annex's own use of http, external special
remotes, hooks, etc use other http implementations and could still be
vulnerable.
The policy is not yet enforced, this commit only adds the appropriate
metadata to remotes.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
They're no worse than http certianly. And, the backport of these
security fixes has to deal with wget, which supports http https and ftp
and has no way to turn off individual schemes, so this will make that
easier.
Security fix!
* git-annex will refuse to download content from http servers on
localhost, or any private IP addresses, to prevent accidental
exposure of internal data. This can be overridden with the
annex.security.allowed-http-addresses setting.
* Since curl's interface does not have a way to prevent it from accessing
localhost or private IP addresses, curl defaults to not being used
for url downloads, even if annex.web-options enabled it before.
Only when annex.security.allowed-http-addresses=all will curl be used.
Since S3 and WebDav use the Manager, the same policies apply to them too.
youtube-dl is not handled yet, and a http proxy configuration can bypass
these checks too. Those cases are still TBD.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
In keyUrls, the GitConfig is used only by annexLocations
to support configured Differences. Since such configurations affect all
clones of a repository, the local repo's GitConfig must have the same
information as the remote's GitConfig would have. So, used getGitConfig
to get the local GitConfig, which is cached and so available cheaply.
That actually fixed a bug noone had ever noticed: keyUrls is
used for remotes accessed over http. The full git config of such a
remote is normally not available, so the remoteGitConfig that keyUrls
used would not have the necessary information in it.
In copyFromRemoteCheap', it uses gitAnnexLocation,
which does need the GitConfig of the remote repo itself in order to
check if it's crippled, supports symlinks, etc. So, made the
State include that GitConfig, cached. The use of gitAnnexLocation is
within a (not $ Git.repoIsUrl repo) guard, so it's local, and so
its git config will always be read and available.
(Note that gitAnnexLocation in turn calls annexLocations, so the
Differences config it uses in this case comes from the remote repo's
GitConfig and not from the local repo's GitConfig. As explained above
this is ok since they must have the same value.)
Not very happy with this mess of different GitConfigs not type-safe and
some read only sometimes etc. Very hairy. Think I got it this change
right. Test suite passes..
This commit was sponsored by Ethan Aubin.
This is groundwork for letting a repo be instantiated the first time
it's actually used, instead of at startup.
The only behavior change is that some old special cases for xmpp remotes
were removed. Where before git-annex silently did nothing with those
no-longer supported remotes, it may now fail in some way.
The additional IO action should have no performance impact as long as
it's simply return.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon