
Rather than use the filename provided by INPUT, which could come from user input, and so could be something that looks like a dashed parameter, use a .git/object/<sha> filename. This avoids user input passing through INPUT and back out, with the file path then passed to a command, which could do something unexpected with a dashed parameter, or other special parameter. Added a note in the design about being careful of passing user input to commands. They still have to be careful of that in general, just not in this case.
109 lines
4.2 KiB
Markdown
109 lines
4.2 KiB
Markdown
**draft**
|
|
|
|
The [[special_remotes/compute]] special remote uses this interface to run
|
|
compute programs.
|
|
|
|
When an compute special remote is initremoted, a program is specified:
|
|
|
|
git-annex initremote myremote type=compute program=git-annex-compute-foo
|
|
|
|
The user adds an annexed file that is computed by the program by running
|
|
a command like one of these:
|
|
|
|
git-annex addcomputed --to=myremote -- convert file.raw file.jpeg passes=10
|
|
git-annex addcomputed --to=myremote -- compress in out --level=9
|
|
git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz
|
|
|
|
Whatever values the user passes to `git-annex addcomputed` are passed to
|
|
the program in `ARGV`, followed by any values that the user provided to
|
|
`git-annex initremote`.
|
|
|
|
To simplify the program's option parsing, any value that the user provides
|
|
that is in the form "foo=bar" will also result in an environment variable
|
|
being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`.
|
|
|
|
For security, the program should avoid exposing user input to the shell
|
|
unprotected, or otherwise executing it. And when running a command, make
|
|
sure that whatever user input is passed to it can result in only safe and
|
|
expected behavior.
|
|
|
|
The program is run in a temporary directory, which will be cleaned up after
|
|
it exits. Note that it may be run in a subdirectory of a temporary
|
|
directory. This is done when `git-annex addcomputed` was run in a subdirectory
|
|
of the git repository.
|
|
|
|
The content of any file in the repository can be an input to the
|
|
computation. The program requests an input by writing a line to stdout:
|
|
|
|
INPUT file.raw
|
|
|
|
Then it can read a line from stdin, which will be the path to the content
|
|
(eg a `.git/annex/objects/` path).
|
|
|
|
If the program needs multiple input files, it should output multiple
|
|
`INPUT` lines first, and then read multiple paths from stdin. This
|
|
allows retrieval of the inputs to potentially run in parallel.
|
|
|
|
If an input file is not available, the program's stdin will be closed
|
|
without a path being written to it. So when reading from stdin fails,
|
|
the program should exit.
|
|
|
|
When `git-annex addcomputed --fast` is being used to add a computation
|
|
to the git-annex repository without actually performing it, the
|
|
response to each "INPUT" will be an empty line rather than the path to
|
|
an input file. In that case, the program should proceed with the rest of
|
|
its output to stdout (eg "OUTPUT" and "REPRODUCIBLE"), but should not
|
|
perform any computation.
|
|
|
|
For each output file that it will compute, the program should write a
|
|
line to stdout:
|
|
|
|
OUTPUT file.jpeg
|
|
|
|
The filename of the output file is both the filename in the program's
|
|
temporary directory, and also the filename that will be added to the
|
|
git-annex repository by `git-annex compute`.
|
|
|
|
If git-annex sees that an output file is growing, it will use its file size
|
|
when displaying progress to the user. So if possible, the program should
|
|
write the content to the file it is computing directly, rather than writing
|
|
to somewhere else and renaming it at the end. But, if the program seeks
|
|
around and writes out of order, it should write to a file somewhere else
|
|
and rename it at the end.
|
|
|
|
The program can also output lines to stdout to indicate its current
|
|
progress:
|
|
|
|
PROGRESS 50%
|
|
|
|
The program can optionally also output a "REPRODUCIBLE" line. That
|
|
indicates that the results of its computations are expected to be
|
|
bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if
|
|
the `--reproducible` option is set.
|
|
|
|
Anything that the program outputs to stderr will be displayed to the user.
|
|
This stderr should be used for error messages, and possibly computation
|
|
output, but not for progress displays.
|
|
|
|
If the program exits nonzero, nothing it computed will be stored in the
|
|
git-annex repository.
|
|
|
|
An example `git-annex-compute-foo` shell script follows:
|
|
|
|
#!/bin/sh
|
|
set -e
|
|
if [ "$1" != "convert" ]; then
|
|
echo "Usage: convert input output [passes=n]" >&2
|
|
exit 1
|
|
fi
|
|
if [ -z "$ANNEX_COMPUTE_passes" ]; then
|
|
ANNEX_COMPUTE_passes=1
|
|
fi
|
|
echo "INPUT $2"
|
|
read input
|
|
echo "OUTPUT $3"
|
|
echo REPRODUCIBLE
|
|
if [ -n "$input" ]; then
|
|
mkdir -p "$(dirname "$3")"
|
|
frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3"
|
|
fi
|