git-annex/doc/design/compute_special_remote_interface.mdwn
2025-02-25 17:26:28 -04:00

108 lines
4.1 KiB
Markdown

**draft**
The [[special_remotes/compute]] special remote uses this interface to run
compute programs.
When an compute special remote is initremoted, a program is specified:
git-annex initremote myremote type=compute program=git-annex-compute-foo
The user adds an annexed file that is computed by the program by running
a command like one of these:
git-annex addcomputed --to=myremote -- convert file.raw file.jpeg passes=10
git-annex addcomputed --to=myremote -- compress in out --level=9
git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz
Whatever values the user passes to `git-annex addcomputed` are passed to
the program in `ARGV`, followed by any values that the user provided to
`git-annex initremote`.
To simplify the program's option parsing, any value that the user provides
that is in the form "foo=bar" will also result in an environment variable
being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`.
For security, the program should avoid exposing user input to the shell
unprotected, or otherwise executing it.
The program is run in a temporary directory, which will be cleaned up after
it exits. Note that it may be run in a subdirectory of its temporary
directory. This is done when `git-annex addcomputed` was run in a subdirectory
of the git repository.
The content of any annexed file in the repository can be an input
to the computation. The program requests an input by writing a line to
stdout:
INPUT file.raw
Then it can read a line from stdin, which will be the path to the content
(eg a `.git/annex/objects/` path).
If the program needs multiple input files, it should output multiple
`INPUT` lines first, and then read multiple paths from stdin. This
allows retrieval of the inputs to potentially run in parallel.
If an input file is not available, the program's stdin will be closed
without a path being written to it. So when reading from stdin fails,
the program should exit.
When `git-annex addcomputed --fast` is being used to add a computation
to the git-annex repository without actually performing it, the
response to each "INPUT" will be an empty line rather than the path to
an input file. In that case, the program should proceed with the rest of
its output to stdout (eg "OUTPUT" and "REPRODUCIBLE"), but should not
perform any computation.
For each output file that it will compute, the program should write a
line to stdout:
OUTPUT file.jpeg
The filename of the output file is both the filename in the program's
temporary directory, and also the filename that will be added to the
git-annex repository by `git-annex compute`.
If git-annex sees that an output file is growing, it will use its file size
when displaying progress to the user. So if possible, the program should
write the content to the file it is computing directly, rather than writing
to somewhere else and renaming it at the end. But, if the program seeks
around and writes out of order, it should write to a file somewhere else
and rename it at the end.
The program can also output lines to stdout to indicate its current
progress:
PROGRESS 50%
The program can optionally also output a "REPRODUCIBLE" line. That
indicates that the results of its computations are expected to be
bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if
the `--reproducible` option is set.
Anything that the program outputs to stderr will be displayed to the user.
This stderr should be used for error messages, and possibly computation
output, but not for progress displays.
If the program exits nonzero, nothing it computed will be stored in the
git-annex repository.
An example `git-annex-compute-foo` shell script follows:
#!/bin/sh
set -e
if [ "$1" != "convert" ]; then
echo "Usage: convert input output [passes=n]" >&2
exit 1
fi
if [ -z "$ANNEX_COMPUTE_passes" ]; then
ANNEX_COMPUTE_passes=1
fi
echo "INPUT $2"
read input
echo "OUTPUT $3"
echo REPRODUCIBLE
if [ -n "$input" ]; then
mkdir -p "$(dirname "$3")"
frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3"
fi