preparing to merge compute

This commit is contained in:
Joey Hess 2025-03-06 14:22:45 -04:00
parent 4979df54d5
commit e952753846
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 111 additions and 2 deletions

View file

@ -1,5 +1,9 @@
git-annex (10.20250116) UNRELEASED; urgency=medium
* Added the compute special remote.
* addcomputed: New command, adds a file that is generated by a compute
special remote.
* recompute: New command, recomputes computed files.
* Support help.autocorrect settings "prompt", "never", and "immediate".
* Allow setting remote.foo.annex-tracking-branch to a branch name
that contains "/", as long as it's not a remote tracking branch.

View file

@ -0,0 +1,22 @@
[[!comment format=mdwn
username="joey"
subject="""Re: DataLad exploration of the compute on demand space"""
date="2025-03-06T17:39:04Z"
content="""
Thanks for explaining the design points of datalad-remake. Some
different design choices than I have made, but mostly they strike me as
implementing what is easier/possible from outside git-annex.
Eg, storing the compute inputs under `.datalad` in the branch is fine --
and might even be useful if you want to make a branch that changes
something in there -- but of course in the git-annex implementation it
stores the equvilant thing in the git-annex branch.
I do hope I'm not closing off the design space from such differences
by dropping a compute special remote right into git-annex. But I also
expect that having a standard and easy way for at least simple
computations will lead to a lot of contributions as others use it.
Your fMRI case seems like one that my compute remote could handle well
and easily.
"""]]

View file

@ -0,0 +1,69 @@
[[!comment format=mdwn
username="joey"
subject="""comment 22"""
date="2025-03-06T17:54:50Z"
content="""
I've merged the compute special remote now.
See [[special_remotes/compute]], [[git-annex-addcomputed]]
and [[git-annex-recompute]].
I have opened [[todo/compute_special_remote_remaining_todos]] with
some various ways that I want to improve it further. Including, notably,
computing on inputs from submodules, which is not currently supported at
all.
----
Here I'll go down mih's original and quite useful design criteria and see
how the compute special remote applies to them:
### Generate annex keys (that have never existed)
`git-annex addcomputed --fast`
### Re-generate annex keys
`git-annex addcomputed` optionally with the --reproducible option,
followed by a later `git-annex get`
Another thing that fits under this heading is when one of the original
input files has gotten modified, and you want to compute a new version of
the output file from it, using the same method as was used to compute it
before. That's `git-annex recompute $output_file`
### Worktree provisioning?
This is the main thing I didn't implement. Given that git-annex is working
with large files and needs to support various filesystems and OS's that
lack hardlinks and softlinks, it's hard to do this inexpensively.
Also, it turned out to make sense for the compute program to request
the input files it needs, since this lets git-annex learn what the input
files are, so it can make them available when regenerating a computed file
later. And so the protocol just has git-annex respond with the path to
the content of the file.
### Request one key, receive many
This is supported. (So is using multiple inputs to produce one (or more)
outputs.)
### Instruction deposition
`git-annex addcomputed`
### Storage redundancy tests
It did make sense to have it automatically `git-annex get` the inputs.
Well, I think it makes sense in most cases, this may become a tunable
setting of the compute special remote.
### Trust
Handled by requiring the user install a `git-annex-compute-foo` command
in PATH, and provide the name of the command to `initremote`.
And for later `enableremote` or `autoenable=true`, it will only
allow programs that are listed in the annex.security.allowed-compute-programs
git config.
"""]]

View file

@ -1,3 +1,19 @@
This is the remainder of my todo list while I was building the
compute special remote. --[[Joey]]
* write a tip showing how to use this
* Write some simple compute programs so we have something to start with.
- convert between images eg jpeg to png
- run a command in a singularity container (that is one of the inputs)
- run a wasm binary (that is one of the inputs)
* compute on input files in submodules
* annex.diskreserve can be violated if getting a file computes it but also
some other output files, which get added to the annex.
* would be nice to have a way to see what computations are used by a
compute remote for a file. Put it in `whereis` output? But it's not an
url. Maybe a separate command? That would also allow querying for eg,
@ -27,8 +43,6 @@
So it, seems that, for this to be done, recompute would need to stage the
pointer file.
* compute on files in submodules
* recompute could ingest keys for other files than the one being
recomputed, and remember them. Then recomputing those files could just
use those keys, without re-running a computation. (Better than --others