git-annex/doc/design/compute_special_remote_interface.mdwn
Joey Hess f32d2aecce
autoenable security for compute special remote
Added annex.security.autoenable-compute-programs and only allow
autoenabling special remotes that use compute programs on that list.

The reason this is needed is a user might have some compute programs
that are less safe to use than others. They might want to use an unsafe
one only with one repository, where they are the only committer or other
committers are trusted. They might be ok with others being used by any
repository, and if so they can add them to the list.

Another reason would be a user who has installed a compute program by
accident. Eg, it might be included with git-annex at some point, or
pulled in by some dependency. That user doesn't necessarily want that
compute program to be used in an autoenabled special remote.
2025-03-03 15:52:56 -04:00

107 lines
4.1 KiB
Markdown

**draft**
The [[special_remotes/compute]] special remote uses this interface to run
compute programs.
When an compute special remote is initremoted, a program is specified:
git-annex initremote myremote type=compute program=git-annex-compute-foo
The user adds an annexed file that is computed by the program by running
a command like one of these:
git-annex addcomputed --to=myremote -- convert file.raw file.jpeg passes=10
git-annex addcomputed --to=myremote -- compress in out --level=9
git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz
Whatever values the user passes to `git-annex addcomputed` are passed to
the program in `ARGV`, followed by any values that the user provided to
`git-annex initremote`.
To simplify the program's option parsing, any value that the user provides
that is in the form "foo=bar" will also result in an environment variable
being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`.
For security, the program should avoid exposing user input to the shell
unprotected, or otherwise executing it.
The program is run in a temporary directory, which will be cleaned up after
it exits. Note that it may be run in a subdirectory of a temporary
directory. This is done when `git-annex addcomputed` was run in a subdirectory
of the git repository.
The content of any file in the repository can be an input to the
computation. The program requests an input by writing a line to stdout:
INPUT file.raw
Then it can read a line from stdin, which will be the path to the content
(eg a `.git/annex/objects/` path).
If the program needs multiple input files, it should output multiple
`INPUT` lines first, and then read multiple paths from stdin. This
allows retrieval of the inputs to potentially run in parallel.
If an input file is not available, the program's stdin will be closed
without a path being written to it. So when reading from stdin fails,
the program should exit.
When `git-annex addcomputed --fast` is being used to add a computation
to the git-annex repository without actually performing it, the
response to each "INPUT" will be an empty line rather than the path to
an input file. In that case, the program should proceed with the rest of
its output to stdout (eg "OUTPUT" and "REPRODUCIBLE"), but should not
perform any computation.
For each output file that it will compute, the program should write a
line to stdout:
OUTPUT file.jpeg
The filename of the output file is both the filename in the program's
temporary directory, and also the filename that will be added to the
git-annex repository by `git-annex compute`.
If git-annex sees that an output file is growing, it will use its file size
when displaying progress to the user. So if possible, the program should
write the content to the file it is computing directly, rather than writing
to somewhere else and renaming it at the end. But, if the program seeks
around and writes out of order, it should write to a file somewhere else
and rename it at the end.
The program can also output lines to stdout to indicate its current
progress:
PROGRESS 50%
The program can optionally also output a "REPRODUCIBLE" line. That
indicates that the results of its computations are expected to be
bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if
the `--reproducible` option is set.
Anything that the program outputs to stderr will be displayed to the user.
This stderr should be used for error messages, and possibly computation
output, but not for progress displays.
If the program exits nonzero, nothing it computed will be stored in the
git-annex repository.
An example `git-annex-compute-foo` shell script follows:
#!/bin/sh
set -e
if [ "$1" != "convert" ]; then
echo "Usage: convert input output [passes=n]" >&2
exit 1
fi
if [ -z "$ANNEX_COMPUTE_passes" ]; then
ANNEX_COMPUTE_passes=1
fi
echo "INPUT $2"
read input
echo "OUTPUT $3"
echo REPRODUCIBLE
if [ -n "$input" ]; then
mkdir -p "$(dirname "$3")"
frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3"
fi