git-annex/doc/design/compute_special_remote_interface.mdwn
2025-03-11 12:54:34 -04:00

168 lines
6.6 KiB
Markdown

The [[special_remotes/compute]] special remote uses this interface to run
compute programs.
When an compute special remote is initremoted, a program is specified:
git-annex initremote myremote type=compute program=git-annex-compute-foo
The user adds an annexed file that is computed by the program by running
a command like one of these:
git-annex addcomputed --to=myremote -- convert file.raw file.jpeg passes=10
git-annex addcomputed --to=myremote -- compress in out --level=9
git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz
## security
Security is very important here, because a user who enables a compute
special remote and runs `git pull` followed by `git-annex get` is running
the compute program with inputs under the control of anyone who has
commit access to the repository.
The contents of input files should be assumed to be untrusted, and so
should the filenames of input and output files, as well as everything
else passed to the program in `ARGV` and the environment.
The program should make sure that whatever user input is passed
to it can result in only safe and expected behavior. The program should
avoid exposing user input to the shell unprotected, or otherwise executing
it. (Except when the program is explicitly running user input in some form
of sandbox.)
## program parameters and environment
Whatever values the user passes to `git-annex addcomputed` are passed to
the program in `ARGV`, followed by any values that the user provided to
`git-annex initremote`.
To simplify the program's option parsing, any value that the user provides
that is in the form "foo=bar" will also result in an environment variable
being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`.
The program is run in a temporary directory, which will be cleaned up after
it exits. It may be run in a subdirectory of the temporary directory. This
is done when `git-annex addcomputed` was run in a subdirectory of the git
repository.
Anything that the program outputs to stderr will be displayed to the user.
This stderr should be used for error messages, and possibly computation
output, but not for progress displays.
If the program exits nonzero, nothing it computed will be stored in the
git-annex repository.
## input files
Before doing any computation, the program needs to communicate with
git-annex about what input files it needs, and what output files it will
generate.
The content of any file in the repository can be an input to the
computation. The program requests an input by writing a line to stdout:
INPUT file.raw
Then it can read a line from stdin, which will be the path to the content
(eg a `.git/annex/objects/` path).
If the program needs multiple input files, it should output multiple
`INPUT` lines first, and then read multiple paths from stdin. This
allows retrieval of the inputs to potentially run in parallel.
If an input file is not available, the program's stdin will be closed
without a path being written to it. So when reading from stdin fails,
the program should exit.
When `git-annex addcomputed --fast` is being used to add a computation to
the git-annex repository without actually performing it, the response to
each `INPUT` will be an empty line rather than the path to an input file.
This can also happen when an input file is not available for whatever
reason. In this case, the program should proceed with the rest of its
output to stdout (eg `OUTPUT` and `REPRODUCIBLE`), but should not perform
any computation.
## output files
For each output file that it will compute, the program should write a
line to stdout, indicating the name of the file that will be added to the
git-annex repository by `git-annex compute`.
OUTPUT file.jpeg
Then it should read a line from stdin, which is the path, in the program's
temporary directory, where it should write the output file. Often this will
be the same filename, but it also may be a sanitized version. It's
important to use that sanitized version to avoid path traversal attacks, as
well as problems like filenames that look like dashed options.
If there is a path traversal attack, the program's stdin will be closed
without a path being written to it.
The program must write a regular file to the output file. Symlinks
or other special files will not be accepted as output files.
If git-annex sees that an output file is growing, it will use its file size
when displaying progress to the user. So if possible, the program should
write the content to the file it is computing directly, rather than writing
to somewhere else and renaming it at the end. But, if the program seeks
around and writes out of order, it should write to a file somewhere else
and rename it at the end.
## other messages
As well as `INPUT` and `OUTPUT` described above, there are some other
messages that the program can output. All of these are optional.
* `PROGRESS 50%`
To indicate its current progress while performing the computation,
the program can output lines like this. This is not needed if the program
streams output to an output file.
* `REPRODUCIBLE`
This indicates that the results of the computation are expected to be
bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if
the `--reproducible` option is set.
* `SANDBOX`
After outputting this line, the program can read a line from stdin
that will be the path to the directory it should sandbox to (which
corresponds to the top of the git repository, so may be above its working
directory). Any `INPUT` lines that come after `SANDBOX` will have input
files be provided via paths that are inside the sandbox directory. Usually
that is done by making hard links, but it will fall back to copying annexed
files if the filesystem does not support hard links.
* `INPUT-REQUIRED`
This works the same as `INPUT`, except when `git-annex addcomputed --fast`
is being used to add a computation to the git-annex repository without
actually performing it, the input file will be provided as a response
to this, rather than the empty line provided as a response to `INPUT`.
If the input file is not available for some reason, an empty line will
still be provided as a response to this.
## example
An example `git-annex-compute-foo` shell script follows:
#!/bin/sh
set -e
if [ "$1" != "convert" ]; then
echo "Usage: convert input output [passes=n]" >&2
exit 1
fi
if [ -z "$ANNEX_COMPUTE_passes" ]; then
ANNEX_COMPUTE_passes=1
fi
echo "INPUT $2"
read input
echo "OUTPUT $3"
read output
echo REPRODUCIBLE
if [ -n "$input" ]; then
frobnicate --passes="$ANNEX_COMPUTE_passes" -i "$input" -o "$output" >&2
fi