Merge branch 'compute'

This commit is contained in:
Joey Hess 2025-03-06 14:23:58 -04:00
commit 6f78341fbf
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
47 changed files with 1771 additions and 161 deletions

View file

@ -0,0 +1,22 @@
[[!comment format=mdwn
username="joey"
subject="""Re: DataLad exploration of the compute on demand space"""
date="2025-03-06T17:39:04Z"
content="""
Thanks for explaining the design points of datalad-remake. Some
different design choices than I have made, but mostly they strike me as
implementing what is easier/possible from outside git-annex.
Eg, storing the compute inputs under `.datalad` in the branch is fine --
and might even be useful if you want to make a branch that changes
something in there -- but of course in the git-annex implementation it
stores the equvilant thing in the git-annex branch.
I do hope I'm not closing off the design space from such differences
by dropping a compute special remote right into git-annex. But I also
expect that having a standard and easy way for at least simple
computations will lead to a lot of contributions as others use it.
Your fMRI case seems like one that my compute remote could handle well
and easily.
"""]]

View file

@ -0,0 +1,69 @@
[[!comment format=mdwn
username="joey"
subject="""comment 22"""
date="2025-03-06T17:54:50Z"
content="""
I've merged the compute special remote now.
See [[special_remotes/compute]], [[git-annex-addcomputed]]
and [[git-annex-recompute]].
I have opened [[todo/compute_special_remote_remaining_todos]] with
some various ways that I want to improve it further. Including, notably,
computing on inputs from submodules, which is not currently supported at
all.
----
Here I'll go down mih's original and quite useful design criteria and see
how the compute special remote applies to them:
### Generate annex keys (that have never existed)
`git-annex addcomputed --fast`
### Re-generate annex keys
`git-annex addcomputed` optionally with the --reproducible option,
followed by a later `git-annex get`
Another thing that fits under this heading is when one of the original
input files has gotten modified, and you want to compute a new version of
the output file from it, using the same method as was used to compute it
before. That's `git-annex recompute $output_file`
### Worktree provisioning?
This is the main thing I didn't implement. Given that git-annex is working
with large files and needs to support various filesystems and OS's that
lack hardlinks and softlinks, it's hard to do this inexpensively.
Also, it turned out to make sense for the compute program to request
the input files it needs, since this lets git-annex learn what the input
files are, so it can make them available when regenerating a computed file
later. And so the protocol just has git-annex respond with the path to
the content of the file.
### Request one key, receive many
This is supported. (So is using multiple inputs to produce one (or more)
outputs.)
### Instruction deposition
`git-annex addcomputed`
### Storage redundancy tests
It did make sense to have it automatically `git-annex get` the inputs.
Well, I think it makes sense in most cases, this may become a tunable
setting of the compute special remote.
### Trust
Handled by requiring the user install a `git-annex-compute-foo` command
in PATH, and provide the name of the command to `initremote`.
And for later `enableremote` or `autoenable=true`, it will only
allow programs that are listed in the annex.security.allowed-compute-programs
git config.
"""]]

View file

@ -0,0 +1,69 @@
This is the remainder of my todo list while I was building the
compute special remote. --[[Joey]]
* write a tip showing how to use this
* Write some simple compute programs so we have something to start with.
- convert between images eg jpeg to png
- run a command in a singularity container (that is one of the inputs)
- run a wasm binary (that is one of the inputs)
* compute on input files in submodules
* annex.diskreserve can be violated if getting a file computes it but also
some other output files, which get added to the annex.
* would be nice to have a way to see what computations are used by a
compute remote for a file. Put it in `whereis` output? But it's not an
url. Maybe a separate command? That would also allow querying for eg,
what files are inputs for another file. Or it could be exposed in the
Remote interface, and made into a file matching option.
* "getting input from <file>" message uses the original filename,
but that file might have been renamed. Would be more clear to use
whatever file in the tree currently points to the key it's getting
(what if there is not one?)
* allow git-annex enableremote with program= explicitly specified,
without checking annex.security.allowed-compute-programs
* addcomputed should honor annex.addunlocked.
What about recompute? It seems it should either write the new version of
the file as an unlocked file when the old version was unlocked, or also
honor annex.addunlocked.
Problem: Since recompute does not stage the file, it would have to write
the content to the working tree. And then the user would need to
git-annex add. But then, if the key was a VURL key, it would add it with
the default backend instead, and the file would no longer use a computed
key.
So it, seems that, for this to be done, recompute would need to stage the
pointer file.
* recompute could ingest keys for other files than the one being
recomputed, and remember them. Then recomputing those files could just
use those keys, without re-running a computation. (Better than --others
which got removed.)
* `git-annex recompute foo bar baz`, when foo depends on bar which depends
on baz, and when baz has changed, will not recompute foo, because bar has
not changed. It then recomputes bar. So running the command again is
needed to recompute foo.
What it could do is, after it recomputes bar, notice that it already
considered foo, and revisit foo, and recompute it then. It could either
use a bloom filter to remember the files it considered but did not
compute, or it could just notice that the command line includes foo
(or includes a directory that contains foo), and then foo is not
modified.
Or it could build a DAG and traverse it, but building a DAG of a large
directory tree has its own problems.
* Should addcomputed honor annex.smallfiles? That would seem to imply
that recompute should also support recomputing non-annexed files.
Otherwise, adding a file and then recomputing it would vary in
what the content of the file is, depending on annex.smallfiles setting.