This needed some refactoring to avoid cycles, since Remote.Compute
cannot import Remote.List. Instead, it uses Annex.remotes. Which must be
populated by something else, but we know it has been, because something
is using Remote.Compute, which it must have found in the remote list,
which populates that.
In Remote.Compute, keyPossibilities' is called with all loggedLocations,
without the trustExclude DeadTrusted that keyLocations does. There is
another cycle there. This may be a problem if a dead repository is still
a remote.
This is missing cycle prevention, and it's certianly possible to make 2
files in the compute remote co-depend on one-another. Hopefully not in a
real world situation, but it an attacker could certainly do it. Cycle
prevention will need to be added to this.
And require for enable as well as autoenable.
It seemed asking for trouble for `git-annex enable foo` to use whatever
compute program is stored in the git config, without verifying that the
user wants that program to be used.
Note that it would be good to allow `git-annex enable foo program=...`
to be used without the program being in the git config. Not implemented yet
though.
Added annex.security.autoenable-compute-programs and only allow
autoenabling special remotes that use compute programs on that list.
The reason this is needed is a user might have some compute programs
that are less safe to use than others. They might want to use an unsafe
one only with one repository, where they are the only committer or other
committers are trusted. They might be ok with others being used by any
repository, and if so they can add them to the list.
Another reason would be a user who has installed a compute program by
accident. Eg, it might be included with git-annex at some point, or
pulled in by some dependency. That user doesn't necessarily want that
compute program to be used in an autoenabled special remote.
Using GIT keys, like are used when exporting git files to special
remotes. Except here the GIT key refers to a file checked into the git
repo.
Note that, since the compute remote uses catObject to get the content,
a symlink that is checked into git does not get followed. This is important
for security, because following a symlink and adding the content to the
repo as an annex object would allow exfiltrating content from outside
the repository.
Instead, the behavior with a symlink is to run the computation on the
symlink target. This may turn out to be confusing, and it might be worth
addcomputed checking if the file in git is a symlink and erroring out.
Or it could follow symlinks as long as the destination is a file in the
repisitory.
Like when getting from the web special remote, when the output of the
computation has changed, record the new hash of the content as an
equivilant key for the VURL key.
Still needs to be done for addcomputed and recompute.
I've lost track of them all, but it includes:
* Using the same key backend as was used in the original computation.
* Fixing bug that prevented updating the source file key in the compute
state
* Handling --reproducible and --unreproducible.
* recompute --original of a file using VURL, when the result is
different, but the key remains the same, makes the object file
be updated with the new content
* Detecting some other ways the program behavior can change, just for
completeness.
* Also adds --backend to addcomputed.
The perform action of this still needs work to do the right thing.
In particular, it currently behaves as if --others was always set.
And, it duplicates a lot of code from addcomputed.
This is limited because the remote config is a field/value map. So order
is not preserved, and when 2 parameters have the same field name, only
the last one will be passed.
For these, use VURL and URL keys, with an "annex-compute:" URI prefix.
These URL keys will look something like this:
URL--annex-compute&cbar4,63pconvert,3-f4d3d72cf3f16ac9c3e9a8012bde4462
Generally it's too long so most of it gets md5summed. It's a little
ugly, but it's what fell out of the existing URL key generation
machinery. I did consider special casing to eg
"URL--annex-compute&c4d3d72cf3f16ac9c3e9a8012bde4462". But it seems at
least possibly useful that the name of the file that was computed is
visible and perhaps one or two words of the git-annex compute command
parameters.
Note that two different output files from the same computation will get
the same URL key. And these keys should remain stable.
Working pretty well. Mostly. But:
* Does not yet support inputs that are non-annexed files checked into git
* --fast is currently broken (will need something like VURL keys)
* --unreproducible still uses a checksumming backend, so drop and get
again will likely fail (needs probably to use an URL key or something
like one)
The compute special remote seems to work pretty well too. Eg,
getting from it works, and dropping content that is present in it works.
Eg, a computation might be run in "foo/" and refer to "../bar" as an
input or output.
So, the subdir is part of the computation state.
Also, prevent input or output of files that are outside the git
repository. Of course, the program can access any file on disk if it
wants to; this is just a guard against mistakes. And it may also be
useful if the program comunicates with something less trusted than it,
eg a container image, so input/output files communicated by that are not
the source of security problems.
Except for some of the hard parts: progress displays, incremental
verification, and getting inputs before running a computation.
Untested! In order to test this, git-annex addcomputed needs to be
implemented.