Also fix support for operating on multiple pairs of files and keys.
Moved notAnnexed to inside starting, so error message will get into the json.
Cannot include the key in the starting as it's not known yet, so instead
add it to the json later.
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Converted warning and similar to use StringContainingQuotedPath. Most
warnings are static strings, some do refer to filepaths that need to be
quoted, and others don't need quoting.
Note that, since quote filters out control characters of even
UnquotedString, this makes all warnings safe, even when an attacker
sneaks in a control character in some other way.
When json is being output, no quoting is done, since json gets its own
quoting.
This does, as a side effect, make warning messages in json output not
be indented. The indentation is only needed to offset warning messages
underneath the display of the file they apply to, so that's ok.
Sponsored-by: Brett Eisenberg on Patreon
giveup changed to filter out control characters. (It is too low level to
make it use StringContainingQuotedPath.)
error still does not, but it should only be used for internal errors,
where the message is not attacker-controlled.
Changed a lot of existing error to giveup when it is not strictly an
internal error.
Of course, other exceptions can still be thrown, either by code in
git-annex, or a library, that include some attacker-controlled value.
This does not guard against those.
Sponsored-by: Noam Kremen on Patreon
When the filenames are part of the git repository or other files that
might have attacker-controlled names, quote them in error messages.
This is fairly complete, although I didn't do the one in
Utility.DirWatcher.INotify.hs because that doesn't have access to
Git.Filename or Annex.
But it's also quite possible I missed some. And also while scanning for
these, I found giveup used with other things that could be attacker
controlled to contain control characters (eg Keys). So, I'm thinking
it would also be good for giveup to just filter out control characters.
This commit is then not the only line of defence, but just good
formatting when git-annex displays a filename in an error message.
Sponsored-by: Kevin Mueller on Patreon
Added StringContainingQuotedPath, which is used for ActionItemOther.
In the process, checked every ActionItemOther for those containing
filenames, and made them use quoting.
Sponsored-by: Graham Spencer on Patreon
The temporary URL key used for the download, before the real key is
generated, was blocked by annex.securehashesonly.
Fixed by passing the Backend that will be used for the final key into
runTransfer. When a Backend is provided, have preCheckSecureHashes
check that, rather than the key being transferred.
Sponsored-by: unqueued on Patreon
--backend is no longer a global option, and is only accepted by commands
that actually need it.
Three commands that used to support backend but don't any longer are
watch, webapp, and assistant. It would be possible to make them support it,
but I doubt anyone used the option with these. And in the case of webapp
and assistant, the option was handled inconsistently, only taking affect
when the command is run with an existing git-annex repo, not when it
creates a new one.
Also, renamed GlobalOption etc to AnnexOption. Because there are many
options of this type that are not actually global (any more) and get
added to commands that need them.
Sponsored-by: Kevin Mueller on Patreon
This avoids it calling enteringStage VerifyStage when it's used in
places that only fall back to verification rarely, and which might be
called while in TransferStage and be going to perform a transfer after
the verification.
Added annex.adjustedbranchrefresh git config to update adjusted branches
set up by git-annex adjust --unlock-present/--hide-missing.
Note, in a few cases, I was not able to make the adjusted branch
be updated in calls to moveAnnex, because information about what
file corresponds to a key is not available. They are:
* If two files point to one file, then eg, `git annex get foo` will
update the branch to unlock foo, but will not unlock bar, because it
does not know about it. Might be fixable by making `git annex get
bar` do something besides skipping bar?
* git-annex-shell recvkey likewise (so sends over ssh from old versions
of git-annex)
* git-annex setkey
* git-annex transferkey if the user does not use --file
* git-annex multicast sends keys with no associated file info
Doing a single full refresh at the end, after any incremental refresh,
will deal with those edge cases.
Lots of nice wins from this in avoiding unncessary work, and I think
nothing got slower.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
No behavior changes (hopefully), just adding SeekInput and plumbing it
through to the JSON display code for later use.
Over the course of 2 grueling days.
withFilesNotInGit reimplemented in terms of seekHelper
should be the only possible behavior change. It seems to test as
behaving the same.
Note that seekHelper dummies up the SeekInput in the case where
segmentPaths' gives up on sorting the expanded paths because there are
too many input paths. When SeekInput later gets exposed as a json field,
that will result in it being a little bit wrong in the case where
100 or more paths are passed to a git-annex command. I think this is a
subtle enough problem to not matter. If it does turn out to be a
problem, fixing it would require splitting up the input
parameters into groups of < 100, which would make git ls-files run
perhaps more than is necessary. May want to revisit this, because that
fix seems fairly low-impact.
retrieveExport is part of ongoing transition to make remote methods
throw exceptions, rather than silently hide them.
getKey very rarely fails, and when it does it's always for the same reason
(user configured annex.backend to url for some reason). So, this will
avoid dealing with Nothing everywhere it's used.
This commit was sponsored by Ilya Shlyakhter on Patreon.
ifAnnexed in a bare repo passes to git cat-file :./filename , which it
refuses to do since the repo is bare.
Note that, reinject somefile someannexedfile in a bare repo silently does
nothing, because someannexedfile is never actually an annexed worktree
file, because the repo is bare.
No behavior changes, but this shows everywhere that a progress meter
could be displayed when hashing a file to add to the annex.
Many of the places don't make sense to display a progress meter though,
eg when importing the copy of the file probably swamps the hashing of
the file.
The goal is to be able to run CommandStart in the main thread when -J is
used, rather than unncessarily passing it off to a worker thread, which
incurs overhead that is signficant when the CommandStart is going to
quickly decide to stop.
To do that, the message it displays needs to be displayed in the worker
thread, after the CommandStart has run.
Also, the change will mean that CommandStart will no longer necessarily
run with the same Annex state as CommandPerform. While its docs already
said it should avoid modifying Annex state, I audited all the
CommandStart code as part of the conversion. (Note that CommandSeek
already sometimes runs with a different Annex state, and that has not been
a source of any problems, so I am not too worried that this change will
lead to breakage going forward.)
The only modification of Annex state I found was it calling
allowMessages in some Commands that default to noMessages. Dealt with
that by adding a startCustomOutput and a startingUsualMessages.
This lets a command start with noMessages and then select the output it
wants for each CommandStart.
One bit of breakage: onlyActionOn has been removed from commands that used it.
The plan is that, since a StartMessage contains an ActionItem,
when a Key can be extracted from that, the parallel job runner can
run onlyActionOn' automatically. Then commands won't need to worry about
this detail. Future work.
Otherwise, this was a fairly straightforward process of making each
CommandStart compile again. Hopefully other behavior changes were mostly
avoided.
In a few cases, a command had a CommandStart that called a CommandPerform
that then called showStart multiple times. I have collapsed those
down to a single start action. The main command to perhaps suffer from it
is Command.Direct, which used to show a start for each file, and no
longer does.
Another minor behavior change is that some commands used showStart
before, but had an associated file and a Key available, so were changed
to ShowStart with an ActionItemAssociatedFile. That will not change the
normal output or behavior, but --json output will now include the key.
This should not break it for anyone using a real json parser.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
This is groundwork for nested seek loops, eg seeking over all files and
then performing commandActions on a list of remotes, which can be done
concurrently.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Leveraged the existing verification code by making it also check the
retrievalSecurityPolicy.
Also, prevented getViaTmp from running the download action at all when the
retrievalSecurityPolicy is going to prevent verifying and so storing it.
Added annex.security.allow-unverified-downloads. A per-remote version
would be nice to have too, but would need more plumbing, so KISS.
(Bill the Cat reference not too over the top I hope. The point is to
make this something the user reads the documentation for before using.)
A few calls to verifyKeyContent and getViaTmp, that don't
involve downloads from remotes, have RetrievalAllKeysSecure hard-coded.
It was also hard-coded for P2P.Annex and Command.RecvKey,
to match the values of the corresponding remotes.
A few things use retrieveKeyFile/retrieveKeyFileCheap without going
through getViaTmp.
* Command.Fsck when downloading content from a remote to verify it.
That content does not get into the annex, so this is ok.
* Command.AddUrl when using a remote to download an url; this is new
content being added, so this is ok.
This commit was sponsored by Fernando Jimenez on Patreon.
Cryptographically secure hashes can be forced to be used in a repository,
by setting annex.securehashesonly. This does not prevent the git repository
from containing files with insecure hashes, but it does prevent the content
of such files from being pulled into .git/annex/objects from another
repository.
We want to make sure that at no point does git-annex accept content into
.git/annex/objects that is hashed with an insecure key. Here's how it
was done:
* .git/annex/objects/xx/yy/KEY/ is kept frozen, so nothing can be
written to it normally
* So every place that writes content must call, thawContent or modifyContent.
We can audit for these, and be sure we've considered all cases.
* The main functions are moveAnnex, and linkToAnnex; these were made to
check annex.securehashesonly, and are the main security boundary
for annex.securehashesonly.
* Most other calls to modifyContent deal with other files in the KEY
directory (inode cache etc). The other ones that mess with the content
are:
- Annex.Direct.toDirectGen, in which content already in the
annex directory is moved to the direct mode file, so not relevant.
- fix and lock, which don't add new content
- Command.ReKey.linkKey, which manually unlocks it to make a
copy.
* All other calls to thawContent appear safe.
Made moveAnnex return a Bool, so checked all callsites and made them
deal with a failure in appropriate ways.
linkToAnnex simply returns LinkAnnexFailed; all callsites already deal
with it failing in appropriate ways.
This commit was sponsored by Riku Voipio.
This is the same as running git annex reinject --known, followed by
git-annex import. The advantage to having it in one command is that it
only has to hash each file once; the two commands have to
hash the imported files a second time.
This commit was sponsored by Shane-o on Patreon.
ghc 8 added backtraces on uncaught errors. This is great, but git-annex was
using error in many places for a error message targeted at the user, in
some known problem case. A backtrace only confuses such a message, so omit it.
Notably, commands like git annex drop that failed due to eg, numcopies,
used to use error, so had a backtrace.
This commit was sponsored by Ethan Aubin.
In c6632ee5c8, it actually only handled
uploading objects to a shared repository. To avoid verification when
downloading objects from a shared repository, was a lot harder.
On the plus side, if the process of downloading a file from a remote
is able to verify its content on the side, the remote can indicate this
now, and avoid the extra post-download verification.
As of yet, I don't have any remotes (except Git) using this ability.
Some more work would be needed to support it in special remotes.
It would make sense for tahoe to implicitly verify things downloaded from it;
as long as you trust your tahoe server (which typically runs locally),
there's cryptographic integrity. OTOH, despite bup being based on shas,
a bup repo under an attacker's control could have the git ref used for an
object changed, and so a bup repo shouldn't implicitly verify. Indeed,
tahoe seems unique in being trustworthy enough to implicitly verify.
* When annex objects are received into git repositories, their checksums are
verified then too.
* To get the old, faster, behavior of not verifying checksums, set
annex.verify=false, or remote.<name>.annex-verify=false.
* setkey, rekey: These commands also now verify that the provided file
matches the key, unless annex.verify=false.
* reinject: Already verified content; this can now be disabled by
setting annex.verify=false.
recvkey and reinject already did verification, so removed now duplicate
code from them. fsck still does its own verification, which is ok since it
does not use getViaTmp, so verification doesn't happen twice when using fsck
--from.
This is a work in progress. It compiles and is able to do basic command
dispatch, including git autocorrection, while using optparse-applicative
for the core commandline parsing.
* Many commands are temporarily disabled before conversion.
* Options are not wired in yet.
* cmdnorepo actions don't work yet.
Also, removed the [Command] list, which was only used in one place.
Only fsck and reinject and the test suite used the Backend, and they can
look it up as needed from the Key. This simplifies the code and also speeds
it up.
There is a small behavior change here. Before, all commands would warn when
acting on an annexed file with an unknown backend. Now, only fsck and
reinject show that warning.
I've been disliking how the command seek actions were written for some
time, with their inversion of control and ugly workarounds.
The last straw to fix it was sync --content, which didn't fit the
Annex [CommandStart] interface well at all. I have not yet made it take
advantage of the changed interface though.
The crucial change, and probably why I didn't do it this way from the
beginning, is to make each CommandStart action be run with exceptions
caught, and if it fails, increment a failure counter in annex state.
So I finally remove the very first code I wrote for git-annex, which
was before I had exception handling in the Annex monad, and so ran outside
that monad, passing state explicitly as it ran each CommandStart action.
This was a real slog from 1 to 5 am.
Test suite passes.
Memory usage is lower than before, sometimes by a couple of megabytes, and
remains constant, even when running in a large repo, and even when
repeatedly failing and incrementing the error counter. So no accidental
laziness space leaks.
Wall clock speed is identical, even in large repos.
This commit was sponsored by an anonymous bitcoiner.