From e952753846d9157f59f4c0e00c4aff437a21f85f Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 6 Mar 2025 14:22:45 -0400 Subject: [PATCH] preparing to merge compute --- CHANGELOG | 4 ++ ..._2546562f7a00e082cd0500debc904cf3._comment | 22 ++++++ ..._d1561153a3916411ed8caa92fa53893c._comment | 69 +++++++++++++++++++ ...ompute_special_remote_remaining_todos.mdwn | 18 ++++- 4 files changed, 111 insertions(+), 2 deletions(-) create mode 100644 doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment create mode 100644 doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment rename TODO-compute => doc/todo/compute_special_remote_remaining_todos.mdwn (82%) diff --git a/CHANGELOG b/CHANGELOG index 475277f8f4..8c944a4bfb 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,5 +1,9 @@ git-annex (10.20250116) UNRELEASED; urgency=medium + * Added the compute special remote. + * addcomputed: New command, adds a file that is generated by a compute + special remote. + * recompute: New command, recomputes computed files. * Support help.autocorrect settings "prompt", "never", and "immediate". * Allow setting remote.foo.annex-tracking-branch to a branch name that contains "/", as long as it's not a remote tracking branch. diff --git a/doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment b/doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment new file mode 100644 index 0000000000..1416d77bde --- /dev/null +++ b/doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment @@ -0,0 +1,22 @@ +[[!comment format=mdwn + username="joey" + subject="""Re: DataLad exploration of the compute on demand space""" + date="2025-03-06T17:39:04Z" + content=""" +Thanks for explaining the design points of datalad-remake. Some +different design choices than I have made, but mostly they strike me as +implementing what is easier/possible from outside git-annex. + +Eg, storing the compute inputs under `.datalad` in the branch is fine -- +and might even be useful if you want to make a branch that changes +something in there -- but of course in the git-annex implementation it +stores the equvilant thing in the git-annex branch. + +I do hope I'm not closing off the design space from such differences +by dropping a compute special remote right into git-annex. But I also +expect that having a standard and easy way for at least simple +computations will lead to a lot of contributions as others use it. + +Your fMRI case seems like one that my compute remote could handle well +and easily. +"""]] diff --git a/doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment b/doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment new file mode 100644 index 0000000000..bfacbdf57d --- /dev/null +++ b/doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment @@ -0,0 +1,69 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 22""" + date="2025-03-06T17:54:50Z" + content=""" +I've merged the compute special remote now. +See [[special_remotes/compute]], [[git-annex-addcomputed]] +and [[git-annex-recompute]]. + +I have opened [[todo/compute_special_remote_remaining_todos]] with +some various ways that I want to improve it further. Including, notably, +computing on inputs from submodules, which is not currently supported at +all. + +---- + +Here I'll go down mih's original and quite useful design criteria and see +how the compute special remote applies to them: + +### Generate annex keys (that have never existed) + +`git-annex addcomputed --fast` + +### Re-generate annex keys + +`git-annex addcomputed` optionally with the --reproducible option, +followed by a later `git-annex get` + +Another thing that fits under this heading is when one of the original +input files has gotten modified, and you want to compute a new version of +the output file from it, using the same method as was used to compute it +before. That's `git-annex recompute $output_file` + +### Worktree provisioning? + +This is the main thing I didn't implement. Given that git-annex is working +with large files and needs to support various filesystems and OS's that +lack hardlinks and softlinks, it's hard to do this inexpensively. + +Also, it turned out to make sense for the compute program to request +the input files it needs, since this lets git-annex learn what the input +files are, so it can make them available when regenerating a computed file +later. And so the protocol just has git-annex respond with the path to +the content of the file. + +### Request one key, receive many + +This is supported. (So is using multiple inputs to produce one (or more) +outputs.) + +### Instruction deposition + +`git-annex addcomputed` + +### Storage redundancy tests + +It did make sense to have it automatically `git-annex get` the inputs. +Well, I think it makes sense in most cases, this may become a tunable +setting of the compute special remote. + +### Trust + +Handled by requiring the user install a `git-annex-compute-foo` command +in PATH, and provide the name of the command to `initremote`. + +And for later `enableremote` or `autoenable=true`, it will only +allow programs that are listed in the annex.security.allowed-compute-programs +git config. +"""]] diff --git a/TODO-compute b/doc/todo/compute_special_remote_remaining_todos.mdwn similarity index 82% rename from TODO-compute rename to doc/todo/compute_special_remote_remaining_todos.mdwn index 7749ad1be3..bb522398a4 100644 --- a/TODO-compute +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,3 +1,19 @@ +This is the remainder of my todo list while I was building the +compute special remote. --[[Joey]] + +* write a tip showing how to use this + +* Write some simple compute programs so we have something to start with. + + - convert between images eg jpeg to png + - run a command in a singularity container (that is one of the inputs) + - run a wasm binary (that is one of the inputs) + +* compute on input files in submodules + +* annex.diskreserve can be violated if getting a file computes it but also + some other output files, which get added to the annex. + * would be nice to have a way to see what computations are used by a compute remote for a file. Put it in `whereis` output? But it's not an url. Maybe a separate command? That would also allow querying for eg, @@ -27,8 +43,6 @@ So it, seems that, for this to be done, recompute would need to stage the pointer file. -* compute on files in submodules - * recompute could ingest keys for other files than the one being recomputed, and remember them. Then recomputing those files could just use those keys, without re-running a computation. (Better than --others