Check just before running update-index if the worktree file's content is
still the same, don't update it when it's been modified. This narrows
the race window a lot, from possibly minutes or hours, to seconds or
less.
(Use replaceFile so that the worktree update happens atomically,
allowing the InodeCache of the new worktree file to itself be gathered
w/o any other race.)
This doesn't eliminate the race; it can still occur in the window before
update-index runs. When annex.queue is large, a lot of files will be
statted by the checks, and so the window may still be large enough to be a
problem.
When only a few files are being processed, the window is as small as it
is in the race where a modification gets overwritten by git-annex when
it updates the worktree. Or maybe as small as whatever race git
checkout/pull/merge may have when the worktree gets modified during it.
Still, I've kept a todo about this race.
This commit was supported by the NSF-funded DataLad project.
Use git update-index --refresh, since it's a little bit more
efficient and the user can be told to run it if a locked index prevents
git-annex from running it.
This also fixes the problem where an annexed file was deleted in the index
and a get of another file that uses the same key caused the index update to
add back the deleted file. update-index will not add back the deleted file.
Documented in tips/unlocked_files.mdwn the gotcha that the index update
may conflict with other operations. I can't see any way to possibly avoid
that conflict.
One new todo about a race that causes a modification to be accidentially
staged.
Note that the assistant only flushes the git command queue when it
commits a modification. I have not tested the assistant with v6 unlocked
files, but assume most users of the assistant won't care if the index
shows a file as modified for a while.
This commit was supported by the NSF-funded DataLad project.
After updating the worktree for an add/drop, update git's index, so git
status will not show the files as modified.
What actually happens is that the index update removes the inode
information from the index. The next git status (or similar) run
then has to do some work. It runs the clean filter.
So, this depends on the clean filter being reasonably fast and on git
not leaking memory when running it. Both problems were fixed in
a96972015d, but only for git 2.5. Anyone
using an older git will see very expensive git status after an add/drop.
This uses the same git update-index queue as other parts of git-annex, so
the actual index update is fairly efficient. Of course, updating the index
does still have some overhead. The annex.queuesize config will control how
often the index gets updated when working on a lot of files.
This is an imperfect workaround... Added several todos about new
problems this workaround causes. Still, this seems a lot better than the
old behavior.
This commit was supported by the NSF-funded DataLad project.
v6 add: Take advantage of improved SIGPIPE handler in git 2.5 to speed up
the clean filter by not reading the file content from the pipe. This also
avoids git buffering the whole file content in memory.
When built with an older git, still consumes stdin. If built with a newer
git and used with an older one, it breaks, but that's acceptable --
checking the git version every time would make repeated smudge runs slow.
This commit was supported by the NSF-funded DataLad project.
When --batch is used with matching options like --in, --metadata, etc, only
operate on the provided files when they match those options. Otherwise, a
blank line is output in the batch protocol.
Affected commands: find, add, whereis, drop, copy, move, get
In the case of find, the documentation for --batch already said it honored
the matching options. The docs for the rest didn't, but it makes sense to
have them honor them. While this is a behavior change, why specify the
matching options with --batch if you didn't want them to apply?
Note that the batch output for all of the affected commands could
already output a blank line in other cases, so batch users should
already be prepared to deal with it.
git-annex metadata didn't seem worth making support the matching options,
since all it does is output metadata or set metadata, the use cases for
using it in combination with the martching options seem small. Made it
refuse to run when they're combined, leaving open the possibility for later
support if a use case develops.
This commit was sponsored by Brett Eisenberg on Patreon.
Added getStaged, to get the versions of git-annex branch files staged in its
index, and use during transitions so the result of merging sibling branches
is used.
The catFileStop in performTransitionsLocked is absolutely necessary,
without that the bug still occurred, because git cat-file was already
running and was looking at the old index file.
Note that getLocal still has cat-file look at the git-annex branch, not the
index. It might be faster if it looked at the index, but probably only
marginally so, and I've not benchmarked it to see if it's faster at all. I
didn't want to change unrelated behavior as part of this bug fix. And as
the need for catFileStop shows, using the index file has added
complications.
Anyway, it still seems fine for getLocal to look at the git-annex branch,
because normally the index file is updated just before the git-annex branch
is committed, and so they'll contain the same information. It's only during
a transition that the two diverge.
This commit was sponsored by Paul Walmsley in honor of Mark Phillips.
It was sorting by uuid, rather than cost!
Avoid future bugs of this kind by changing the Ord to primarily compare
by cost, with uuid only used when the cost is the same.
This commit was supported by the NSF-funded DataLad project.
Added annex.commitmessage config that can specify a commit message for the
git-annex branch instead of the usual "update".
This commit was supported by the NSF-funded DataLad project.
Useful for dropping old objects from cache repositories.
But also, quite a genrally useful thing to have..
Rather than imitiating find's -atime and other options, all of which are
pretty horrible to use, I made this match files accessed within a time
period, using the same duration format used by git-annex schedule and
--limit-time
In passing, changed the --limit-time option parser to parse the
duration, instead of having it later throw an error.
This commit was supported by the NSF-funded DataLad project.
Added remote.name.annex-speculate-present config that can be used to
make cache remotes.
Implemented it in Remote.keyPossibilities, which is used by the
get/move/copy/mirror commands, and nothing else. This way, things like
whereis will not show content that's speculatively present.
The assistant and sync --content were not using Remote.keyPossibilities,
and were changed to use it.
The efficiency hit should be small; Remote.keyPossibilities is only
used before transferring a file, which is the expensive operation.
And, it's only doing one lookup of the remoteList and a very cheap
filter over it.
Note that, git-annex still updates the location log when copying content
to a remote with annex-speculate-present set. In this case, the location
tracking will indicate that content is present in the remote. This may
not be wanted for caches, or may not be a real problem for them. TBD.
This commit was supported by the NSF-funded DataLad project.
Support working trees set up by git-worktree, by setting up some symlinks
such that git-annex links work right.
Also improved support for repositories created with --separate-git-dir.
At least recent git makes a .git file for those (older may have used a
symlink?), so that also needs to be converted to a symlink.
This commit was sponsored by Nick Piper on Patreon.
Work around git bug that runs smudge/clean filters at the top of the
repository while passing them a relative GIT_WORK_TREE that may point
outside of the repository, by using GIT_PREFIX to get back to the
subdirectory where a relative GIT_WORK_TREE is valid.
git devs have been informed of the bug and may fix it, which could conveivably
break this fix, but as it is, this works back to git 1.7.6.
This commit was sponsored by Jochen Bartl on Patreon.
Send User-Agent and any configured annex.http-headers when downloading with
http, fixes reversion introduced when switching to http-client.
This commit was sponsored by mo on Patreon.
Fix reversion introduced in version 6.20180316 that caused git-annex to
stop processing files when unable to contact a ssh remote.
The bug was not in any of the changed lines, but this one in inAnnex:
P2PHelper.checkpresent (Ssh.runProto rmt connpool (cantCheck rmt) fallback) key
cantCheck throws an exception, but that parameter to runProto expects a
value, which it returns. So, inAnnex is returning a Bool containing an
exception. This defeats the usual checks for checkPresent throwing an
exception, crashing git-annex.
Fixed by making runProto take an `Annex a` instead of an `a`, so
passing cantCheck to it doesn't nest exceptions.
This commit was sponsored by andrea rota.
5f0f063a7a documented it as being
configured automatically, but the code never did that. Rather than try
to hard-code whatever urls amazon uses for its buckets, it seems better
to ask the user to find the url and set it.
addurl: When security configuration prevents downloads with youtube-dl,
still check if the url is one that it supports, and fail downloading it,
instead of downloading the raw web page.
Leveraged the existing verification code by making it also check the
retrievalSecurityPolicy.
Also, prevented getViaTmp from running the download action at all when the
retrievalSecurityPolicy is going to prevent verifying and so storing it.
Added annex.security.allow-unverified-downloads. A per-remote version
would be nice to have too, but would need more plumbing, so KISS.
(Bill the Cat reference not too over the top I hope. The point is to
make this something the user reads the documentation for before using.)
A few calls to verifyKeyContent and getViaTmp, that don't
involve downloads from remotes, have RetrievalAllKeysSecure hard-coded.
It was also hard-coded for P2P.Annex and Command.RecvKey,
to match the values of the corresponding remotes.
A few things use retrieveKeyFile/retrieveKeyFileCheap without going
through getViaTmp.
* Command.Fsck when downloading content from a remote to verify it.
That content does not get into the annex, so this is ok.
* Command.AddUrl when using a remote to download an url; this is new
content being added, so this is ok.
This commit was sponsored by Fernando Jimenez on Patreon.
They're no worse than http certianly. And, the backport of these
security fixes has to deal with wget, which supports http https and ftp
and has no way to turn off individual schemes, so this will make that
easier.
Security fix!
* git-annex will refuse to download content from http servers on
localhost, or any private IP addresses, to prevent accidental
exposure of internal data. This can be overridden with the
annex.security.allowed-http-addresses setting.
* Since curl's interface does not have a way to prevent it from accessing
localhost or private IP addresses, curl defaults to not being used
for url downloads, even if annex.web-options enabled it before.
Only when annex.security.allowed-http-addresses=all will curl be used.
Since S3 and WebDav use the Manager, the same policies apply to them too.
youtube-dl is not handled yet, and a http proxy configuration can bypass
these checks too. Those cases are still TBD.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
Security fix! Allowing any schemes, particularly file: and
possibly others like scp: allowed file exfiltration by anyone who had
write access to the git repository, since they could add an annexed file
using such an url, or using an url that redirected to such an url,
and wait for the victim to get it into their repository and send them a copy.
* Added annex.security.allowed-url-schemes setting, which defaults
to only allowing http and https URLs. Note especially that file:/
is no longer enabled by default.
* Removed annex.web-download-command, since its interface does not allow
supporting annex.security.allowed-url-schemes across redirects.
If you used this setting, you may want to instead use annex.web-options
to pass options to curl.
With annex.web-download-command removed, nearly all url accesses in
git-annex are made via Utility.Url via http-client or curl. http-client
only supports http and https, so no problem there.
(Disabling one and not the other is not implemented.)
Used curl --proto to limit the allowed url schemes.
Note that this will cause git annex fsck --from web to mark files using
a disallowed url scheme as not being present in the web. That seems
acceptable; fsck --from web also does that when a web server is not available.
youtube-dl already disabled file: itself (probably for similar
reasons). The scheme check was also added to youtube-dl urls for
completeness, although that check won't catch any redirects it might
follow. But youtube-dl goes off and does its own thing with other
protocols anyway, so that's fine.
Special remotes that support other domain-specific url schemes are not
affected by this change. In the bittorrent remote, aria2c can still
download magnet: links. The download of the .torrent file is
otherwise now limited by annex.security.allowed-url-schemes.
This does not address any external special remotes that might download
an url themselves. Current thinking is all external special remotes will
need to be audited for this problem, although many of them will use
http libraries that only support http and not curl's menagarie.
The related problem of accessing private localhost and LAN urls is not
addressed by this commit.
This commit was sponsored by Brett Eisenberg on Patreon.
Display error messages that come from git-annex-shell when the p2p protocol
is used, so that diskreserve messages, IO errors, etc from the remote side
are visible again.
Felt like it should perhaps use outputError, so --json-error-messages would
include these, but as an async IO action, it can't, and this would need
MessageState to be converted to a tvar. Anyway, when not using p2pstdio,
that's not done; nor is it done for stderr from external special remotes
or other commands, so punted on the idea for now.
This commit was sponsored by mo on Patreon.
I can't find any documentation of how long it should be. Hard to imagine
it being shorter than 4 characters though, so put that in as a conservative
lower bound.
This commit was sponsored by Nick Piper on Patreon.
External special remotes can now add info to `git annex info $remote`, by
replying to the GETINFO message.
Had to generalize some helpers to allow consuming multiple messages from
the remote.
The code added to Remote/* here is AGPL licensed, thus changed the license
of the files.
This commit was sponsored by Jake Vosloo on Patreon.
In keyUrls, the GitConfig is used only by annexLocations
to support configured Differences. Since such configurations affect all
clones of a repository, the local repo's GitConfig must have the same
information as the remote's GitConfig would have. So, used getGitConfig
to get the local GitConfig, which is cached and so available cheaply.
That actually fixed a bug noone had ever noticed: keyUrls is
used for remotes accessed over http. The full git config of such a
remote is normally not available, so the remoteGitConfig that keyUrls
used would not have the necessary information in it.
In copyFromRemoteCheap', it uses gitAnnexLocation,
which does need the GitConfig of the remote repo itself in order to
check if it's crippled, supports symlinks, etc. So, made the
State include that GitConfig, cached. The use of gitAnnexLocation is
within a (not $ Git.repoIsUrl repo) guard, so it's local, and so
its git config will always be read and available.
(Note that gitAnnexLocation in turn calls annexLocations, so the
Differences config it uses in this case comes from the remote repo's
GitConfig and not from the local repo's GitConfig. As explained above
this is ok since they must have the same value.)
Not very happy with this mess of different GitConfigs not type-safe and
some read only sometimes etc. Very hairy. Think I got it this change
right. Test suite passes..
This commit was sponsored by Ethan Aubin.
Unfortunately one more use remains..
This should be just as fast as the other method. The remote's Git.Repo
has already had its config read, so Annex.new's call to Git.Config.read
is a noop.
Thid commit was sponsored by andrea rota.
Fixed annex-checkuuid implementation, so that remotes configured that way
can be used. This was 100% broken from the first commit of it, oops.
This commit was sponsored by Øyvind Andersen Holm.
This is groundwork for letting a repo be instantiated the first time
it's actually used, instead of at startup.
The only behavior change is that some old special cases for xmpp remotes
were removed. Where before git-annex silently did nothing with those
no-longer supported remotes, it may now fail in some way.
The additional IO action should have no performance impact as long as
it's simply return.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon
Show operating system and repository version list when run outside
a git repo too.
Also made it only display the local repository version when in a git-annex
repo. Before it showed "unknown" when run in a git repo that was not
git-annex initialized. That seemed like confusing behavior.
This commit was sponsored by Jochen Bartl on Patreon.
https://prime.haskell.org/wiki/Libraries/Proposals/SemigroupMonoid
I am not happy with the fragile pile of CPP boilerplate required to support
ghc back to 7.0, which git-annex still targets for both the android build
and the standalone build targeting old linux kernels. It makes me unlikely
to want to use Semigroup more in git-annex, because the benefit of the
abstraction is swamped by the ugliness. I actually considered ripping out
all the Semigroup instances, but some are needed to use
optparse-applicative.
The problem, I think, is they made this transaction on too fast a timeline.
(Although ironically, work on it started in 2015 or earlier!)
In particular, Debian oldstable is not out of security support, and it's
not possible to follow the simpler workarounds documented on the wiki and
have it build on oldstable (because the semigroups package in it is too
old).
I have only tested this build with ghc 8.2.2, not the newer and older
versions that branches of the CPP support. So there could be typoes, we'll
see.
This commit was sponsored by Brock Spratlen on Patreon.
Makes it allow writes, but not deletion of annexed content. Note that
securing pushes to the git repository is left up to the user.
This commit was sponsored by Jack Hill on Patreon.
* migrate: Fix bug in migration between eg SHA256 and SHA256E,
that caused the extension to be included in SHA256 keys,
and omitted from SHA256E keys.
(Bug introduced in version 6.20170214)
* migrate: Check for above bug when migrating from SHA256 to SHA256
(and same for SHA1 to SHA1 etc), and remove the extension that should
not be in the SHA256 key.
* fsck: Detect and warn when keys need an upgrade, either to fix up
from the above migrate bug, or to add missing size information
(a long ago transition), or because of a few other past key related
bugs.
This commit was sponsored by Henrik Riomar on Patreon.
Prevent haskell http-client from decompressing gzip files, so downloads of
such files works the same as it used to with wget and curl.
Explicitly setting accept-encoding to "identity" is probably not needed,
but that's what wget sends (curl does not send the header), and since
http-client is trying to be excessively smart, it seems we need to set
hAcceptEncoding to something to prevent it from inserting its own,
and this seems better than some hack like "".
This commit was sponsored by Ole-Morten Duesund on Patreon.
* move: --force was accidentially enabling two unrelated behaviors
since 6.20180427. The older behavior, which has never been well
documented and seems almost entirely useless, has been removed.
* copy: --force no longer does anything.
This commit was sponsored by Øyvind Andersen Holm.