Merge /home/joey/tmp/git-annex into ospath

2025-01-28 15:29:58 -04:00 · 2025-01-28 15:29:58 -04:00 · 917c43f31f
commit 917c43f31f
parent 7ebef6cd1b 87cda29dd7
19 changed files with 509 additions and 16 deletions
--- a/doc/todo/compute_special_remote/comment_10_304b925c5c54b1fd980446920780be00._comment
+++ b/doc/todo/compute_special_remote/comment_10_304b925c5c54b1fd980446920780be00._comment
@ -0,0 +1,70 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 10"""
+ date="2025-01-28T14:06:41Z"
+ content="""
+Using metadata to store the inputs of computations like I did in my example
+above seems that it would allow the metadata to be changed later, which
+would change the output when a key gets recomputed. That feels surprising,
+because metadata could be changed for any reason, without the intention
+of affecting a compute special remote.
+
+It might be possible for git-annex to pin down the current state of
+metadata (or the whole git-annex branch) and provide the same input to the
+computation when it's run again. (Unless `git-annex forget` has caused
+that old branch state to be lost..) But it can't fully isolate the program
+from all unpinned inputs without using some form of containerization,
+which feels out of scope for git-annex.
+
+Instead of using metadata, the input values could be stored in the
+per-special-remote state of the generated key. Or the input values could be
+encoded in the key itself, but then two computations that generate the same
+output would have two different keys, rather than hashing to the same key.
+
+Using a key with a regular hash backend also lets the user find out if the
+computation turns out to not be reproducible later for whatever reason;
+getting the file from the compute special remote will fail at hash
+verification time. Something like a VURL key could still alternatively be
+used in cases where reproducibility is not important.
+
+To add a computed file, the interface would look close to the same,
+but now the --value options are setting fields in the compute special
+remote's state:
+
+	git-annex addcomputed foo --to ffmpeg-cut \
+	    --input source=input.mov \
+	    --value starttime=15:00 \
+	    --value endtime=30:00
+
+The values could be provided to the "git-annex-compute-" program with
+environment variables.
+
+For `--input source=foo`, it could look up the git-annex key (or git sha1)
+of that file, and store that in the state. So it would provide the compute
+program with the same data every time. But it could *also* store the
+filename. And that allows for a command like this:
+
+	git-annex recompute foo --from ffmpeg-cut
+
+Which, when the input.mov file has been changed, would re-run the
+computation with the new content of the file, and stage a new version of
+the computed file. It could even be used to recompute every file in a tree:
+
+	git-annex recompute . --from ffmpeg-cut
+	
+Also, that command could let input values be adjusted later:
+
+	git-annex recompute foo --from ffmpeg-cut --value starttime=14:50
+	git commit -m 'include the introduction of the speaker in the clip'
+
+It would also be good to have a command that examines a computed key
+and displays the values and inputs. That could be `git-annex whereis`
+or perhaps a dedicated command with more structured output:
+
+	git-annex examinecompute foo --from ffmpeg-cut
+	source=input.mov (annex key SHA256--xxxxxxxxx)
+	starttime=15:00
+	endtime=30:00
+
+This all feels like it might allow for some useful workflows...
+"""]]
--- a/doc/todo/compute_special_remote/comment_11_5addc5ef9399ffedc23190c9d4e566ce._comment
+++ b/doc/todo/compute_special_remote/comment_11_5addc5ef9399ffedc23190c9d4e566ce._comment
@ -0,0 +1,24 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""Re: worktree provisioning"""
+ date="2025-01-28T14:08:29Z"
+ content="""
+@m.risse in your example the "data.nc" file gets new content when
+retrieved from the special remote and the source file has changed. 
+
+But if you already have data.nc file present in a repository, it
+does not get updated immediately when you update the source
+"data.grib" file.
+
+So, a drop and re-get of a file changes the version of the file you have
+available. For that matter, if the old version has been stored on other
+remotes, a get may retrieve either an old or a new version. 
+That is not intuitive and it makes me wonder if using a
+special remote is really a good fit for what you're wanting to do.
+
+In your "cdo" example, it's not clear to me if the new version of the
+software generates an identical file to the old, or if it has a bug fix
+that causes it to generate a significantly different output. If the two
+outputs are significantly different then treating them as the same
+git-annex key seems questionable to me.
+"""]]
--- a/doc/todo/compute_special_remote/comment_12_ddc985546fee804733c4ec485253e98f._comment
+++ b/doc/todo/compute_special_remote/comment_12_ddc985546fee804733c4ec485253e98f._comment
@ -0,0 +1,29 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 12"""
+ date="2025-01-28T15:39:44Z"
+ content="""
+My design so far does not fully support
+"Request one key, receive many".
+
+My `git-annex addcomputed` command doesn't handle the case where a
+computation generates multiple output files. While the `git-annex-compute-`
+command's interface could let it return several computed files, addcomputed
+would only adds one file to the name that the user specifies. What is it
+supposed to do if the computation generates more than one? Maybe it needs a
+way to let a whole directory be populated with the files generated by a
+computation. Or a way to specify multiple files to add.
+
+And here's another problem:
+Suppose I have one very expensive computation that generates files foo
+and bar. And a second, less expensive computation, that also generates foo
+(same content) as well as generating baz. Both computations are run on the
+same compute special remote. Now if the user runs `git-annex get foo`,
+they will be unhappy if it chooses to run the expensive computation,
+rather than the less expensive computation.
+
+Since the per-special remote state for a key is used as the computation
+input, only one input can be saved for foo's key. So it wouldn't really be
+picking between two alernatives, it would just use whatever the current
+state for that key is.
+"""]]
--- a/doc/todo/compute_special_remote/comment_3_573cb6c3ee8d1a2072c61559f81dc32c._comment
+++ b/doc/todo/compute_special_remote/comment_3_573cb6c3ee8d1a2072c61559f81dc32c._comment
@ -3,5 +3,5 @@
 subject="""comment 3"""
 date="2024-04-30T19:31:35Z"
 content="""
-See also [[todo/wishlist__58___derived_content_support]].
+See also [[todo/wishlist:_derived_content_support]].
 """]]
--- a/doc/todo/compute_special_remote/comment_6_f1760976e65ae16d4d79f004ac924e55._comment
+++ b/doc/todo/compute_special_remote/comment_6_f1760976e65ae16d4d79f004ac924e55._comment
@ -3,11 +3,30 @@
 subject="""comment 6"""
 date="2024-04-30T19:53:43Z"
 content="""
-On trust, it seems to me that if someone chooses to enable a particular
-special remote, they are choosing to trust whatever kind of computations it
-supports.
+On trust, it seems to me that if someone chooses to install a
+particular special remote, they are choosing to trust whatever kind of
+computations it supports.

 Eg a special remote could choose to always run a computation inside a
 particular container system and then if you trust that container system is
-secure, you can choose to use it.
+secure, you can choose to install it.
+
+Enabling the special remote is not necessary, because a
+repository can be set to autoenable a special remote. In some sense this is
+surprising. I had originally talked about enabling here and then I
+remembered autoenable.
+
+It may be that autoenable should only be allowed for
+special remote programs that the user explicitly whitelists, not only
+installs into PATH. That would break some existing workflows, though
+setting some git configs would not be too hard.
+
+There seems scope for both compute special remotes that execute code that
+comes from the git repository, and ones that only have metadata about the
+computation recorded in the git repository, in a way that cannot let them
+execute arbitrary code under the control of the git repository.
+
+A well-behaved compute special remote that does run code that comes from a
+git repository could require an additional git config to be set to allow it
+to do that.
 """]]
--- a/doc/todo/compute_special_remote/comment_9_2e10caa2ecbba0f53a3ab031a94c9907._comment
+++ b/doc/todo/compute_special_remote/comment_9_2e10caa2ecbba0f53a3ab031a94c9907._comment
@ -0,0 +1,75 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 9"""
+ date="2025-01-27T14:46:43Z"
+ content="""
+Circling back to this, I think the fork in the road is whether this is
+about git-annex providing this and that feature to support external special
+remotes that compute, or whether git-annex gets a compute special
+remote of its own with some simpler/better extension interface
+than the external special remote protocol.
+
+Of course, git-annex having its own compute special remote would not
+preclude other external special remotes that compute. And for that matter,
+a single external special remote could implement an extension interface.
+
+---
+
+Thinking about how a generic compute special remote in git-annex could
+work, multiple instances of it could be initremoted:
+
+	git-annex initremote convertfiles type=compute program=csv-to-xslx
+	git-annex initremote cutvideo type=compute program=ffmpeg-cut
+
+Here the "program" parameter would cause a program like 
+`git-annex-compute-ffmpeg-cut` to be run to get files from that instance
+of the compute special remote. The interface could be as simple as it
+being run with the key that it is requested to compute, and outputting
+the paths to the all keys it was able to compute. (So allowing for 
+"request one key, receive many".) Perhaps also with some way to indicate
+progess of the computation.
+
+It would make sense to store the details of computations in git-annex
+metadata. And a compute program can use git-annex commands to get files
+it depends on. Eg, `git-annex-compute-ffmpeg-cut` could run:
+
+	# look up the configured metadata
+	starttime=$(git-annex metadata --get compute-ffmpeg-starttime --key=$requested)
+	endtime=$(git-annex metadata --get compute-ffmpeg-endtime --key=$requested)
+	source=$(git-annex metadata --get compute-ffmpeg-source --key=$requested)
+
+	# get the source video file
+	git-annex get --key=$source
+	git-annex examinekey --format='${objectpath}' $source
+
+It might be worth formalizing that a given computed key can depend on other
+keys, and have git-annex always get/compute those keys first. And provide
+them to the program in a worktree?
+
+When asked to store a key in the compute special remote, it would verify
+that the key can be generated by it. Using the same interface as used to
+get a key.
+
+This all leaves a chicken and egg problem, how does the user add a computed
+file if they don't know the key yet?
+
+The user could manually run the commands that generate the computed file,
+then `git-annex add` it, and set the metadata. Then `git-annex copy --to`
+the compute remote would verify if the file can be generated, and add it if
+so. This seems awkward, but also nice to be able to do manually.
+
+Or, something like VURL keys could be used, with an interface something
+like this:
+
+	git-annex addcomputed foo --to ffmpeg-cut
+		--input compute-ffmpeg-source=input.mov 
+		--set compute-ffmpeg-starttime=15:00
+		--set compute-ffmpeg-endtime=30:00
+
+All that would do is generate some arbitrary VURL key or similar,
+provisionally set the provided metadata (how?), and try to store the key
+in the compute special remote. If it succeeds, stage an annex pointer
+and commit the metadata. Since it's a VURL key, storing the key in the
+compute special remote would also record the hash of the generated file
+at that point.
+"""]]
--- a/doc/todo/generic_p2p_socket_transport/comment_6_4641d3ad4a8a8f17f8df47e02555dfa2._comment
+++ b/doc/todo/generic_p2p_socket_transport/comment_6_4641d3ad4a8a8f17f8df47e02555dfa2._comment
@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="matrss"
+ avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
+ subject="comment 6"
+ date="2025-01-27T15:26:15Z"
+ content="""
+> > If the PSK were fully contained in the remote string then a third-party getting hold of that string could pretend to be the server
+
+> I agree this would be a problem, but how would a third-party get ahold of the string though? Remote urls don't usually get stored in the git repository, perhaps you were thinking of some other way.
+
+My thinking was that git remote URLs usually aren't sensitive information that inherently grant access to a repository, so a construct where the remote URL contains the credentials is just unexpected. A careless user might e.g. put it into a `type=git` special remote or treat it in some other way in which one wouldn't treat a password, without considering the implications. I am not aware of a way in which they could be leaked without user intervention, though.
+
+Having separate credentials explicitly named as such just seems safer. But in the end this would be the responsibility of the one implementing the p2p transport, anyway.
+"""]]