add compute tip

2025-03-12 13:43:50 -04:00 · 2025-03-12 13:43:50 -04:00 · e505ade963
commit e505ade963
parent a673fc7cfd
2 changed files with 235 additions and 0 deletions
--- a/doc/special_remotes/compute.mdwn
+++ b/doc/special_remotes/compute.mdwn
@ -26,6 +26,8 @@ program takes a dashed option, it can be provided after "--":
    # git-annex initremote myremote type=compute program=git-annex-compute-foo -- --level=9
 See [[tips/computing_annexed_files]] for more documentation.
 ## compute programs
 To write programs used by the compute special remote, see the 
--- a/doc/tips/computing_annexed_files.mdwn
+++ b/doc/tips/computing_annexed_files.mdwn
@ -0,0 +1,233 @@
 Do you ever check in original versions of files to `git-annex`, but then
 convert them in some way? Maybe you check in original photos from a camera,
 but then change them to a more useful file format, or smaller resolution. 
 Or you clip a video file. Or you crunch some data to a result.
 If you check the computed file into `git-annex` too, and store it on
 your remotes along with the original, that's a waste of disk space.
 But it is so convenient to be able to `git-annex get` the computed file.
 The [[compute special remote|special_remotes/compute]] is the solution to
 this. It "stores" the computed file by remembering how to compute it from
 input files. When you `git-annex get` the computed file from it, it re-runs
 the computation on the original input file to produced the computed file.
 [[!toc ]]
 ## using the compute special remote
 There are many compute programs that each handle some type of computation,
 and it's pretty easy to write your own compute program too. In this tip,
 we'll use [[special_remotes/compute/git-annex-compute-imageconvert]], 
 which uses imagemagick to convert between image formats.
 To follow along, install that program in PATH (and remember to make it
 executable!) and make sure you have
 [imagemagick](https://www.imagemagick.org/) installed.
 First, initialize a compute remote:
    # git-annex initremote imageconvert type=compute program=git-annex-compute-imageconvert
 Now suppose you have a file `foo.jpeg`, and you want to add a computed
 `foo.gif` to the git-annex repository.
    # git-annex addcomputed --to=imageconvert foo.jpeg foo.gif
 (The syntax of the `git-annex addcomputed` command will vary depending on the
 program that a compute remote uses. Some may have multiple input files, or
 multiple ouput files, or other options to control the computation. See
 the documentation of each compute program for details.)
 Now you have `foo.gif` and can use it as usual, including copying it to
 other remotes. But it's already "stored" in the imageconvert remote,
 as a computation. So to free up space, you can drop it:
    # git-annex drop foo.gif
    drop foo.gif ok
 By the way, you can also add a computed file to the repository 
 without bothering to compute it yet! Just use `--fast`:
    # git-annex addcomputed --fast --to=imageconvert bar.jpeg bar.gif
 Now suppose you're in another clone of this same repository, and you want
 these gifs.
    # git-annex get foo.gif
    get foo.gif (not available)
 	  Maybe enable some of these special remotes (git annex enableremote ...):
 	  	8332f7ad-d54e-435e-803b-138c1cfa7b71 -- imageconvert
 	failed
 With [[special_remotes/compute/git-annex-compute-imageconvert]] and
 imagemagic installed, all you need to do is enable the special remote to
 get the computed files from it:
 	# git-annex enableremote imageconvert
 	# git-annex get foo.gif
 	get foo.gif (from imageconvert...)
    (getting input foo.jpeg from origin...)
 	ok
 Notice that, when the input file is not present in the repository, getting
 a file from a compute remote will first get the input file.
 That's the basics of using the compute special remote.
 ## recomputation
 What happens if the input file `foo.gif` is changed to a new version?
 Will getting `foo.jpeg` from the compute remote base it on the new version
 too? No. `foo.gif` is stuck on the original version of the input file that
 was used to compute it.
 But, it's easy to recompute the file with a new version of the input file.
 Just `git-annex add` the new version of the input file, and then:
    # git-annex recompute foo.gif
    recompute foo.gif (foo.jpeg changed)
    ok
 You can use commands like `git diff` and `git status` to see the
 change that made to `foo.gif`.
    # git status --short foo.gif
     M foo.gif
 Now both the new and old versions of `foo.gif` are stored in the
 imageconvert remote, and it can compute either as needed.
 ## reproducibility
 You might be wondering, what happens if a computed file, such as `foo.gif`
 isn't exactly the same identical file each time it's computed? For example,
 what if there's a timestamp in there.
 The answer is that, by default, files computed by a compute special remote
 are not required, or guaranteed to be bit-for-bit reproducible. One gif
 converted from a jpeg is much like any other converted from the same jpeg.
 So git-annex usually treats all files computed in the same way from the
 same input as interchangeable. (Unless the compute program indicates
 that it produces reproducible files.)
 Sometimes though, it's important that a file be bit-for-bit reproducible. And
 you can ask git-annex to enforce this for computed files.
 There is a `--reproducible` option for this, which you can pass to
 `git-annex addcomputed` or to `git-annex recompute`.
 Let's switch the computed `foo.gif` to a reproducible file:
    # git-annex recompute --original --reproducible foo.gif
    recompute foo.gif
    ok
 You can `git commit foo.gif` to store this change.
 But first, let's check if that computation actually *is* reproducible.
 This is easy, just drop it and get it from the compute remote again:
    # git-annex drop foo.gif
    drop foo.gif ok
    # git-annex get foo.gif --from imageconvert
    get foo.gif (from imageconvert...)
    ok
 If it turned out that the computation was not reproducible, getting the
 file would fail, like this:
    # git-annex get foo.gif --from imageconvert
    get foo.gif (from imageconvert...)
    Verification of content failed
 This is because a reproducible file uses a regular [[backend]], which
 by default uses a hash to verify the content of the file.
 If it does turn out that a file that was expected to be reproducible isn't,
 you can always convert it to an unreproducible file:
    # git-annex recompute --original --unreproducible foo.gif
    recompute foo.gif
    ok
 ## writing your own compute programs
 There is a whole little protocol that compute programs use to
 communicate with git-annex. It's all documented at
 [[design/compute_special_remote_interface]].
 But it's really easy to write simple ones, and you don't need to
 dive into all the details to do it. Let's walk through the code
 to [[special_remotes/compute/git-annex-compute-imageconvert]],
 which at 14 lines, is about as simple as one can be.
    #!/bin/sh
 It's a shell script.
    set -e
 If it fails to read input from standard input, or if a command fails, it
 will exit nonzero.
 	if [ -z "$1" ] || [ -z "$2" ]; then
 		echo "Specify the input image file, followed by the output image file." >&2
 		echo "Example: foo.jpg foo.gif" >&2
 		exit 1
 	fi
 It expects to be passed two parameters, which were "foo.jpeg" and "foo.gif" in
 the examples above. And it outputs some usage to stderr otherwise. That is
 displayed if the user runs `git-annex addcomputed` without the necessary
 filenames.
 	echo "INPUT $1"
 	read input
 It tells git-annex that the first filename is the input file. And git-annex
 replies by telling it *where* the content of the input file is. This is the
 path to a git-annex object file.
 	echo "OUTPUT $2"
 	read output
 It tells git-annex that the second filename is the output file. And git-annex
 replies by telling it the path it should write the output file to.
 	if [ -n "$input" ]; then
 When `git-annex addcomputed --fast` is used, the program shouldn't actually
 read the input file or compute the output file. git-annex indicates this by
 not giving it a path to the input file. That's checked here.
 		convert "$input" "$output" >&2
 This uses `convert` from imagemagick, and just converts the input file to
 the output file.
 Notice that stdout from `convert` is redirected to stderr. This is done
 because the compute program is speaking this protocol with git-annex over
 stdin and stdout, and we don't want random program output to mess that up.
 	fi
 Closing the `if` above.
 And that's all!
 Now you know almost enough to write your own compute program. Editing this one
 will be a good start.
 **But first, a word about security.**
 A user who enables a compute special remote and runs `git pull` followed by
 `git-annex get` is running the compute program with inputs under the control
 of anyone who has commit access to the repository.
 So, it's important that your compute program be secure. Please see
 the section on security in [[design/compute_special_remote_interface]]
 for security considerations.
 If you write a nice secure compute program, you can add it to the list
 in [[special_remotes/compute]] so other people can use it.