Merge branch 'master' of ssh://git-annex.branchable.com

This commit is contained in:
Joey Hess 2024-04-17 15:27:09 -04:00
commit d55e3f5fe2
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
13 changed files with 404 additions and 1 deletions

View file

@ -0,0 +1,62 @@
# Enable git-annex to provision file content by other means than download
This idea [goes back many years](https://github.com/datalad/datalad/issues/2850), and has been [iterated on repeatedly afterwards](https://github.com/datalad/datalad-next/issues/143), and most recently at Distribits 2024.
The following is a summary of what role git-annex could play in this functionality.
The basic idea is to wrap a provision-by-compute process into the standard interface of a git annex remote.
A consumer would (theoretically) not need to worry about how an annex key is provided, they would simply get-annex-get it, whether this leads to a download or a computation.
Moreover, access cost and redundancies could be managed/communicated using established patterns.
## Use cases
Here are a few concrete use cases that illustrate why one would want to have functionality like this
### Generate annex keys (that have never existed)
This can be useful for leaving instructions how, e.g. other data formats can be generated from a single format that is kept on storage.
For example, a collection of CSV files is stored, but an XLSX variant can be generated upon request automatically.
Or a single large live stream video is stored, and a collection of shorter clips is generated from a cue sheet or cut list.
### Re-generate annex keys
This can be useful when storing a key is expensive, but its exact identity is known/important. For example, an outcome of a scientific computation yields a large output that is expensive to compute and to store, yet needs to be tracked for repeated further processing -- the cost of a recomputation may be reduced, by storing (smaller) intermediate results, and leaving instruction how to perform (a different) computation that yields the identical original output.
This second scenario, where annex keys are reproduced exactly, can be considered the general case. It generally requires exact inputs to the computation, where as the first scenario can/should handle an application of a compute instruction on any compatible input data.
## What is in scope for git-annex?
The execution of arbitrary code without any inherent trust is a problem. A problem that git-annex may not want to get into. Moreover, there are many candidate environments for code execution -- a complexity that git-annex may not want to get into either.
### External remote protocol sufficient?
From my point of view, pretty much all required functionality could be hidden behind the external remote protocol and thereby inside on or more special remote implementations.
- `STORE`: somehow capture the computing instructions, likely linking some code to some (key-specific) parameters, like input files
- `CHECKPRESENT`: do compute instruction for a key exist?
- `RETRIEVE`: compute the key
- `REMOVE`: remove the instructions/parameter record
- `WHEREIS`: give information on computation/inputs
where `SET/GETSTATE` may implement the instruction deposit/retrieval.
### Worktree provisioning?
Such external remote implementation would need a way to create suitable worktrees to (re)run a given code. Git-annex support to provide a (separate) worktree for a repository at a specific commit, with efficient (re)use of the main repository's annex would simplify such implementations.
### Request one key, receive many
It is possible that a single computation yields multiple annex keys, even when git-annex only asked for a single one (which is what it would do, sequentially, when driving a special remote). It would be good to be able to capture that and avoid needless duplication of computations.
### Instruction deposition
Using `STORE` (`git annex copy --to`) record instructions is possible (say particular ENV variables are used that pass information to a special remote), but is more or less a hack. It would be useful to have a dedicated command to accept and deposit such a record in association with one or more annex keys (which may or may not be known at that time). This likely require settling on a format for such records.
### Storage redundancy tests
I believe that no particular handling of annex key that are declared inputs to computing instructions for other keys are needed. Still listing it here to state that, and be possibly proven wrong.
### Trust
We would need a way for users to indicate that they trust a particular compute introduction or the entity that provided it. Even if git-annex does not implement tooling for that, it would be good to settle on a concept that can be interpreted/implemented by such special remotes.

View file

@ -0,0 +1,15 @@
[[!comment format=mdwn
username="m.risse@77eac2c22d673d5f10305c0bade738ad74055f92"
nickname="m.risse"
avatar="http://cdn.libravatar.org/avatar/59541f50d845e5f81aff06e88a38b9de"
subject="prior art"
date="2024-04-13T20:30:56Z"
content="""
I just want to mention that I've implemented/tried to implement something like this in <https://github.com/matrss/datalad-getexec>. It basically just records a command line invocation to execute and all required input files as base64-encoded json in a URL with a custom scheme, which made it surprisingly simple to implement. I haven't touched it in a while and it was more of an experiment, but other than issues with dependencies on files in sub-datasets it worked pretty well. The main motivation to build it was the mentioned use-case of automatically converting between file formats. Of course it doesn't address all of your mentioned points. E.g. trust is something I haven't considered in my experiments, at all. But it shows that the current special remote protocol is sufficient for a basic implementation of this.
I like the proposed \"request one key, receive many\" extension to the special remote protocol and I think that could be useful in other \"unusual\" special remotes as well.
I don't quite understand the necessity for \"Worktree provisioning\". If I understand that right, I think it would just make things more complicated and unintuitive compared to always staying in HEAD.
\"Instruction deposition\" is essentially just adding a URL to a key in my implementation, which is pretty nice. Using the built-in relaxed option automatically gives the distinction between generating keys that have never existed and regenerating keys.
"""]]

View file

@ -0,0 +1,18 @@
[[!comment format=mdwn
username="mih"
avatar="http://cdn.libravatar.org/avatar/f881df265a423e4f24eff27c623148fd"
subject="Need for more than HEAD/URL?"
date="2024-04-15T05:00:58Z"
content="""
> \"Instruction deposition\" is essentially just adding a URL to a key in my implementation, which is pretty nice. Using the built-in relaxed option automatically gives the distinction between generating keys that have never existed and regenerating keys.
Thanks for the pointer, very useful!
Regarding the points you raised:
Datalad's `run` feature has been around for some years, and we have seen usage in the wild with command lines that are small programs and dozens, sometimes hundreds of inputs. It is true that anything could be simply URL-encoded. However, especially with command-patterns (always same, except parameter change) that may be needlessly heavy. Maybe it would compress well (likely), but it still poses a maintenance issue. Say the compute instructions need an update (software API change): Updating one shared instruction set is a simpler task than sifting through annex-keys and rewriting URLs.
> I don't quite understand the necessity for \"Worktree provisioning\". If I understand that right, I think it would just make things more complicated and unintuitive compared to always staying in HEAD.
We need a worktree different from `HEAD` whenever HEAD has changed from the original worktree used for setting up a compute instruction. Say a command needs two input files, but one has been moved to a different directory in current `HEAD`. An implementation would now either say \"no longer available\" and force maintenance update, or be able to provision the respective worktree. In case of no provision capability we would need to replace the URL-encoded instructions (this would make the key uncomputable in earlier versions), or amend with an additional instruction set (and now we would start to accumulate cruft where changes in the git-annex branch need to account for (unrelated) changes in any other branch).
"""]]

View file

@ -0,0 +1,14 @@
In our case we are storing videos using timestamp in the filename, e.g.
```
2024.03.08.09.31.09.041_2024.03.08.09.34.53.759.mkv
```
where last number is `ms`. `git-annex` for MD5E decides that extension is `.759.mkv`, so if we rename file (adjust timing), it seems to produce a new key.
I wonder if you have any ideas Joey on how to overcome it (smarter extension deduction? some config to "hardcode" target extension to be .mkv?)?
Just throwing against the wall to see if sticks
[[!meta author=yoh]]
[[!tag projects/repronim]]