Merge branch 'compute'

2025-03-06 14:23:58 -04:00 · 2025-03-06 14:23:58 -04:00 · 6f78341fbf
commit 6f78341fbf
parent 203a730e28 e952753846
47 changed files with 1771 additions and 161 deletions
--- a/doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment
+++ b/doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment
@ -0,0 +1,22 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""Re: DataLad exploration of the compute on demand space"""
+ date="2025-03-06T17:39:04Z"
+ content="""
+Thanks for explaining the design points of datalad-remake. Some
+different design choices than I have made, but mostly they strike me as
+implementing what is easier/possible from outside git-annex. 
+
+Eg, storing the compute inputs under `.datalad` in the branch is fine --
+and might even be useful if you want to make a branch that changes
+something in there -- but of course in the git-annex implementation it
+stores the equvilant thing in the git-annex branch.
+
+I do hope I'm not closing off the design space from such differences
+by dropping a compute special remote right into git-annex. But I also
+expect that having a standard and easy way for at least simple
+computations will lead to a lot of contributions as others use it.
+
+Your fMRI case seems like one that my compute remote could handle well
+and easily.
+"""]]
--- a/doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment
+++ b/doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment
@ -0,0 +1,69 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 22"""
+ date="2025-03-06T17:54:50Z"
+ content="""
+I've merged the compute special remote now.
+See [[special_remotes/compute]], [[git-annex-addcomputed]]
+and [[git-annex-recompute]].
+
+I have opened [[todo/compute_special_remote_remaining_todos]] with
+some various ways that I want to improve it further. Including, notably,
+computing on inputs from submodules, which is not currently supported at
+all.
+
+----
+
+Here I'll go down mih's original and quite useful design criteria and see
+how the compute special remote applies to them:
+
+### Generate annex keys (that have never existed)
+
+`git-annex addcomputed --fast`
+
+### Re-generate annex keys
+
+`git-annex addcomputed` optionally with the --reproducible option,
+followed by a later `git-annex get`
+
+Another thing that fits under this heading is when one of the original
+input files has gotten modified, and you want to compute a new version of
+the output file from it, using the same method as was used to compute it
+before. That's `git-annex recompute $output_file`
+
+### Worktree provisioning?
+
+This is the main thing I didn't implement. Given that git-annex is working
+with large files and needs to support various filesystems and OS's that
+lack hardlinks and softlinks, it's hard to do this inexpensively.
+
+Also, it turned out to make sense for the compute program to request
+the input files it needs, since this lets git-annex learn what the input
+files are, so it can make them available when regenerating a computed file
+later. And so the protocol just has git-annex respond with the path to
+the content of the file.
+
+### Request one key, receive many
+
+This is supported. (So is using multiple inputs to produce one (or more)
+outputs.)
+
+### Instruction deposition
+
+`git-annex addcomputed`
+
+### Storage redundancy tests
+
+It did make sense to have it automatically `git-annex get` the inputs.
+Well, I think it makes sense in most cases, this may become a tunable
+setting of the compute special remote.
+
+### Trust
+
+Handled by requiring the user install a `git-annex-compute-foo` command
+in PATH, and provide the name of the command to `initremote`.
+
+And for later `enableremote` or `autoenable=true`, it will only
+allow programs that are listed in the annex.security.allowed-compute-programs
+git config.
+"""]]
--- a/doc/todo/compute_special_remote_remaining_todos.mdwn
+++ b/doc/todo/compute_special_remote_remaining_todos.mdwn
@ -0,0 +1,69 @@
+This is the remainder of my todo list while I was building the
+compute special remote. --[[Joey]]
+
+* write a tip showing how to use this
+
+* Write some simple compute programs so we have something to start with.
+
+  - convert between images eg jpeg to png
+  - run a command in a singularity container (that is one of the inputs)
+  - run a wasm binary (that is one of the inputs)
+
+* compute on input files in submodules
+
+* annex.diskreserve can be violated if getting a file computes it but also
+  some other output files, which get added to the annex.
+
+* would be nice to have a way to see what computations are used by a
+  compute remote for a file. Put it in `whereis` output? But it's not an
+  url. Maybe a separate command? That would also allow querying for eg,
+  what files are inputs for another file. Or it could be exposed in the
+  Remote interface, and made into a file matching option.
+
+* "getting input from <file>" message uses the original filename,
+  but that file might have been renamed. Would be more clear to use
+  whatever file in the tree currently points to the key it's getting
+  (what if there is not one?)
+
+* allow git-annex enableremote with program= explicitly specified,
+  without checking annex.security.allowed-compute-programs
+
+* addcomputed should honor annex.addunlocked.
+
+  What about recompute? It seems it should either write the new version of
+  the file as an unlocked file when the old version was unlocked, or also
+  honor annex.addunlocked.
+  
+  Problem: Since recompute does not stage the file, it would have to write
+  the content to the working tree. And then the user would need to
+  git-annex add. But then, if the key was a VURL key, it would add it with
+  the default backend instead, and the file would no longer use a computed
+  key. 
+
+  So it, seems that, for this to be done, recompute would need to stage the
+  pointer file.
+
+* recompute could ingest keys for other files than the one being
+  recomputed, and remember them. Then recomputing those files could just
+  use those keys, without re-running a computation. (Better than --others
+  which got removed.)
+
+* `git-annex recompute foo bar baz`, when foo depends on bar which depends
+  on baz, and when baz has changed, will not recompute foo, because bar has
+  not changed. It then recomputes bar. So running the command again is
+  needed to recompute foo. 
+
+  What it could do is, after it recomputes bar, notice that it already
+  considered foo, and revisit foo, and recompute it then. It could either
+  use a bloom filter to remember the files it considered but did not
+  compute, or it could just notice that the command line includes foo
+  (or includes a directory that contains foo), and then foo is not
+  modified.
+
+  Or it could build a DAG and traverse it, but building a DAG of a large
+  directory tree has its own problems.
+
+* Should addcomputed honor annex.smallfiles? That would seem to imply
+  that recompute should also support recomputing non-annexed files.
+  Otherwise, adding a file and then recomputing it would vary in
+  what the content of the file is, depending on annex.smallfiles setting.