git-annex/doc/design/compute_special_remote_interface.mdwn
Joey Hess a2fc471e14
safer git sha object filename
Rather than use the filename provided by INPUT, which could come from user
input, and so could be something that looks like a dashed parameter,
use a .git/object/<sha> filename.

This avoids user input passing through INPUT and back out, with the file
path then passed to a command, which could do something unexpected with
a dashed parameter, or other special parameter.

Added a note in the design about being careful of passing user input to
commands. They still have to be careful of that in general, just not in
this case.
2025-03-04 14:54:13 -04:00

109 lines
4.2 KiB
Markdown

**draft**
The [[special_remotes/compute]] special remote uses this interface to run
compute programs.
When an compute special remote is initremoted, a program is specified:
git-annex initremote myremote type=compute program=git-annex-compute-foo
The user adds an annexed file that is computed by the program by running
a command like one of these:
git-annex addcomputed --to=myremote -- convert file.raw file.jpeg passes=10
git-annex addcomputed --to=myremote -- compress in out --level=9
git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz
Whatever values the user passes to `git-annex addcomputed` are passed to
the program in `ARGV`, followed by any values that the user provided to
`git-annex initremote`.
To simplify the program's option parsing, any value that the user provides
that is in the form "foo=bar" will also result in an environment variable
being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`.
For security, the program should avoid exposing user input to the shell
unprotected, or otherwise executing it. And when running a command, make
sure that whatever user input is passed to it can result in only safe and
expected behavior.
The program is run in a temporary directory, which will be cleaned up after
it exits. Note that it may be run in a subdirectory of a temporary
directory. This is done when `git-annex addcomputed` was run in a subdirectory
of the git repository.
The content of any file in the repository can be an input to the
computation. The program requests an input by writing a line to stdout:
INPUT file.raw
Then it can read a line from stdin, which will be the path to the content
(eg a `.git/annex/objects/` path).
If the program needs multiple input files, it should output multiple
`INPUT` lines first, and then read multiple paths from stdin. This
allows retrieval of the inputs to potentially run in parallel.
If an input file is not available, the program's stdin will be closed
without a path being written to it. So when reading from stdin fails,
the program should exit.
When `git-annex addcomputed --fast` is being used to add a computation
to the git-annex repository without actually performing it, the
response to each "INPUT" will be an empty line rather than the path to
an input file. In that case, the program should proceed with the rest of
its output to stdout (eg "OUTPUT" and "REPRODUCIBLE"), but should not
perform any computation.
For each output file that it will compute, the program should write a
line to stdout:
OUTPUT file.jpeg
The filename of the output file is both the filename in the program's
temporary directory, and also the filename that will be added to the
git-annex repository by `git-annex compute`.
If git-annex sees that an output file is growing, it will use its file size
when displaying progress to the user. So if possible, the program should
write the content to the file it is computing directly, rather than writing
to somewhere else and renaming it at the end. But, if the program seeks
around and writes out of order, it should write to a file somewhere else
and rename it at the end.
The program can also output lines to stdout to indicate its current
progress:
PROGRESS 50%
The program can optionally also output a "REPRODUCIBLE" line. That
indicates that the results of its computations are expected to be
bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if
the `--reproducible` option is set.
Anything that the program outputs to stderr will be displayed to the user.
This stderr should be used for error messages, and possibly computation
output, but not for progress displays.
If the program exits nonzero, nothing it computed will be stored in the
git-annex repository.
An example `git-annex-compute-foo` shell script follows:
#!/bin/sh
set -e
if [ "$1" != "convert" ]; then
echo "Usage: convert input output [passes=n]" >&2
exit 1
fi
if [ -z "$ANNEX_COMPUTE_passes" ]; then
ANNEX_COMPUTE_passes=1
fi
echo "INPUT $2"
read input
echo "OUTPUT $3"
echo REPRODUCIBLE
if [ -n "$input" ]; then
mkdir -p "$(dirname "$3")"
frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3"
fi