merge in doc changes from master

2025-01-29 18:57:25 -04:00 · 2025-01-29 18:57:25 -04:00 · cbb6df35aa
commit cbb6df35aa
parent 9e4314de76
38 changed files with 1136 additions and 27 deletions
--- a/doc/todo/RawFilePath_conversion.mdwn
+++ b/doc/todo/RawFilePath_conversion.mdwn
@ -8,17 +8,24 @@ Some commands like `git-annex find` use RawFilePath end-to-end.
 But this conversion is not yet complete. This is a todo to keep track of the
 status.

-* The Abstract FilePath proposal (AFPP) has been implemented, and so a number of
-  libraries like unix and directory now have versions that operate on
-  OSPath. That could be used in git-annex eg for things like
-  getDirectoryContents, when built against those versions.
-  (But OSPath uses ShortByteString, while RawFilePath is ByteString, so
-  conversion still entails a copy.)
-* withFile remains to be converted, and is used in several important code
-  paths, including Annex.Journal and Annex.Link. 
-  There is a RawFilePath version in file-io library, but that is
-  not currently a git-annex dependency. (withFile is in base, and base is
-  unlikely to convert to AFPP soon)
+* unix has modules that operate on RawFilePath but no OSPath versions yet.
+  See https://github.com/haskell/unix/issues/240
+* filepath-1.4.100 implements support for OSPath. It is bundled with
+  ghc-9.6.1 and above. Will need to switch from filepath-bytestring to
+  this, and to avoid a lot of ifdefs, probably only after git-annex no
+  longers supports building with older ghc versions. This will entail
+  replacing all the RawFilePath with OsPath, which should be pretty
+  mechanical, with only some wrapper functions in Utility.FileIO and
+  Utility.RawFilePath needing to be changed.
+* Utility.FileIO is used for most withFile and openFile, but not yet for
+  readFile, writeFile, and appendFile on FilePaths.
+  Note that the FilePath versions do newline translation on windows, 
+  which has to be handled when converting to the Utility.FileIO ones.
+* System.Directory.OsPath is available with OsPath build flag, but
+  not yet used, and would eliminate a lot of fromRawFilePaths.
+  Make Utility.SystemDirectory import it when built with OsPath,
+  and the remaining 6 hours or work will explain itself..
+  This has been started in the `ospath` branch.

 [[!tag confirmed]]

--- a/doc/todo/compute_special_remote/comment_10_304b925c5c54b1fd980446920780be00._comment
+++ b/doc/todo/compute_special_remote/comment_10_304b925c5c54b1fd980446920780be00._comment
@ -0,0 +1,70 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 10"""
+ date="2025-01-28T14:06:41Z"
+ content="""
+Using metadata to store the inputs of computations like I did in my example
+above seems that it would allow the metadata to be changed later, which
+would change the output when a key gets recomputed. That feels surprising,
+because metadata could be changed for any reason, without the intention
+of affecting a compute special remote.
+
+It might be possible for git-annex to pin down the current state of
+metadata (or the whole git-annex branch) and provide the same input to the
+computation when it's run again. (Unless `git-annex forget` has caused
+that old branch state to be lost..) But it can't fully isolate the program
+from all unpinned inputs without using some form of containerization,
+which feels out of scope for git-annex.
+
+Instead of using metadata, the input values could be stored in the
+per-special-remote state of the generated key. Or the input values could be
+encoded in the key itself, but then two computations that generate the same
+output would have two different keys, rather than hashing to the same key.
+
+Using a key with a regular hash backend also lets the user find out if the
+computation turns out to not be reproducible later for whatever reason;
+getting the file from the compute special remote will fail at hash
+verification time. Something like a VURL key could still alternatively be
+used in cases where reproducibility is not important.
+
+To add a computed file, the interface would look close to the same,
+but now the --value options are setting fields in the compute special
+remote's state:
+
+	git-annex addcomputed foo --to ffmpeg-cut \
+	    --input source=input.mov \
+	    --value starttime=15:00 \
+	    --value endtime=30:00
+
+The values could be provided to the "git-annex-compute-" program with
+environment variables.
+
+For `--input source=foo`, it could look up the git-annex key (or git sha1)
+of that file, and store that in the state. So it would provide the compute
+program with the same data every time. But it could *also* store the
+filename. And that allows for a command like this:
+
+	git-annex recompute foo --from ffmpeg-cut
+
+Which, when the input.mov file has been changed, would re-run the
+computation with the new content of the file, and stage a new version of
+the computed file. It could even be used to recompute every file in a tree:
+
+	git-annex recompute . --from ffmpeg-cut
+	
+Also, that command could let input values be adjusted later:
+
+	git-annex recompute foo --from ffmpeg-cut --value starttime=14:50
+	git commit -m 'include the introduction of the speaker in the clip'
+
+It would also be good to have a command that examines a computed key
+and displays the values and inputs. That could be `git-annex whereis`
+or perhaps a dedicated command with more structured output:
+
+	git-annex examinecompute foo --from ffmpeg-cut
+	source=input.mov (annex key SHA256--xxxxxxxxx)
+	starttime=15:00
+	endtime=30:00
+
+This all feels like it might allow for some useful workflows...
+"""]]
--- a/doc/todo/compute_special_remote/comment_11_5addc5ef9399ffedc23190c9d4e566ce._comment
+++ b/doc/todo/compute_special_remote/comment_11_5addc5ef9399ffedc23190c9d4e566ce._comment
@ -0,0 +1,24 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""Re: worktree provisioning"""
+ date="2025-01-28T14:08:29Z"
+ content="""
+@m.risse in your example the "data.nc" file gets new content when
+retrieved from the special remote and the source file has changed. 
+
+But if you already have data.nc file present in a repository, it
+does not get updated immediately when you update the source
+"data.grib" file.
+
+So, a drop and re-get of a file changes the version of the file you have
+available. For that matter, if the old version has been stored on other
+remotes, a get may retrieve either an old or a new version. 
+That is not intuitive and it makes me wonder if using a
+special remote is really a good fit for what you're wanting to do.
+
+In your "cdo" example, it's not clear to me if the new version of the
+software generates an identical file to the old, or if it has a bug fix
+that causes it to generate a significantly different output. If the two
+outputs are significantly different then treating them as the same
+git-annex key seems questionable to me.
+"""]]
--- a/doc/todo/compute_special_remote/comment_12_ddc985546fee804733c4ec485253e98f._comment
+++ b/doc/todo/compute_special_remote/comment_12_ddc985546fee804733c4ec485253e98f._comment
@ -0,0 +1,29 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 12"""
+ date="2025-01-28T15:39:44Z"
+ content="""
+My design so far does not fully support
+"Request one key, receive many".
+
+My `git-annex addcomputed` command doesn't handle the case where a
+computation generates multiple output files. While the `git-annex-compute-`
+command's interface could let it return several computed files, addcomputed
+would only adds one file to the name that the user specifies. What is it
+supposed to do if the computation generates more than one? Maybe it needs a
+way to let a whole directory be populated with the files generated by a
+computation. Or a way to specify multiple files to add.
+
+And here's another problem:
+Suppose I have one very expensive computation that generates files foo
+and bar. And a second, less expensive computation, that also generates foo
+(same content) as well as generating baz. Both computations are run on the
+same compute special remote. Now if the user runs `git-annex get foo`,
+they will be unhappy if it chooses to run the expensive computation,
+rather than the less expensive computation.
+
+Since the per-special remote state for a key is used as the computation
+input, only one input can be saved for foo's key. So it wouldn't really be
+picking between two alernatives, it would just use whatever the current
+state for that key is.
+"""]]
--- a/doc/todo/compute_special_remote/comment_13_b6b1c8e9dc9e1d818036385fd073ed21._comment
+++ b/doc/todo/compute_special_remote/comment_13_b6b1c8e9dc9e1d818036385fd073ed21._comment
@ -0,0 +1,24 @@
+[[!comment format=mdwn
+ username="matrss"
+ avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
+ subject="comment 13"
+ date="2025-01-29T09:56:12Z"
+ content="""
+> @m.risse in your example the \"data.nc\" file gets new content when retrieved from the special remote and the source file has changed.
+
+True, that can happen, and the user was explicit in that they either don't care about it (non-checksum backend, URL in my PoC), or do care (checksum backend) and git-annex would fail the checksum verification.
+
+> But if you already have data.nc file present in a repository, it does not get updated immediately when you update the source \"data.grib\" file.
+> 
+> So, a drop and re-get of a file changes the version of the file you have available. For that matter, if the old version has been stored on other remotes, a get may retrieve either an old or a new version. That is not intuitive and it makes me wonder if using a special remote is really a good fit for what you're wanting to do
+
+This I haven't entirely thought through. I'd say if the key uses a non-checksum backend, then it can only be assumed and is the users responsibility that the resulting file is functionally, even if not bit-by-bit, identical. E.g. with netCDF checksums can differ due to small details like chunking, but the data might be the same. With a checksum backend git-annex would just fail the next recompute, but the interactions with copies on other remotes could indeed get confusing.
+
+> In your \"cdo\" example, it's not clear to me if the new version of the software generates an identical file to the old, or if it has a bug fix that causes it to generate a significantly different output. If the two outputs are significantly different then treating them as the same git-annex key seems questionable to me.
+
+Again, two possible cases depending on if the key uses a checksum or a non-checksum backend. With a checksum: if the new version produces the same output everything is fine; if the new version produces different output then git-annex would indicate this discrepancy on the next recompute and the user has to decide how to handle it (probably by checking that the output of the new version is either functionally the same or in some way \"better\" than the old one and updating the repository to record this new key as that file).
+
+Without a checksum backend the user would again have been explicit in that they don't care if the data changes for whatever reason, the key is essentially just a placeholder for the computation without a guarantee about its content.
+
+Something like VURL would be a compromise between the two: it would avoid the upfront cost of computing all files (which might be very expensive), but still instruct git-annex to error out if the checksum changes at some point after the first compute. A regular migration of the computed-files-so-far to a checksum backend could achieve the same.
+"""]]
--- a/doc/todo/compute_special_remote/comment_14_f0a575875e1f8809906ba4021e879b43._comment
+++ b/doc/todo/compute_special_remote/comment_14_f0a575875e1f8809906ba4021e879b43._comment
@ -0,0 +1,11 @@
+[[!comment format=mdwn
+ username="matrss"
+ avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
+ subject="comment 14"
+ date="2025-01-29T10:13:59Z"
+ content="""
+Some thoughts regarding your ideas:
+
+- Multiple output files could always be emulated by generating a single archive file and registering additional compute instructions that simply extract each output file from that archive. I think there could be some convenience functionality on the CLI side to set that up and the key of the archive file might not even need to correspond to an actual file in the tree.
+- For my use-cases (and I think DataLad at large) it is important to make this feature work across repository boundaries. E.g. I would like to use this feature to build a derived dataset from <https://atris.fz-juelich.de/MeteoCloud/ERA5>, where exactly this conversion from grib to netcdf happens in the compute step. I'd like to have the netcdf outputs as a separate dataset as some users might only be interested in the grib files, and it would scale better when there is more than just one kind of output that can be derived from an input by computation. `git annex get` doesn't work recursively across submodules/subdatasets though, and `datalad get` does not understand keys, just paths (at least so far).
+"""]]
--- a/doc/todo/compute_special_remote/comment_3_573cb6c3ee8d1a2072c61559f81dc32c._comment
+++ b/doc/todo/compute_special_remote/comment_3_573cb6c3ee8d1a2072c61559f81dc32c._comment
@ -3,5 +3,5 @@
 subject="""comment 3"""
 date="2024-04-30T19:31:35Z"
 content="""
-See also [[todo/wishlist__58___derived_content_support]].
+See also [[todo/wishlist:_derived_content_support]].
 """]]
--- a/doc/todo/compute_special_remote/comment_6_f1760976e65ae16d4d79f004ac924e55._comment
+++ b/doc/todo/compute_special_remote/comment_6_f1760976e65ae16d4d79f004ac924e55._comment
@ -3,11 +3,30 @@
 subject="""comment 6"""
 date="2024-04-30T19:53:43Z"
 content="""
-On trust, it seems to me that if someone chooses to enable a particular
-special remote, they are choosing to trust whatever kind of computations it
-supports.
+On trust, it seems to me that if someone chooses to install a
+particular special remote, they are choosing to trust whatever kind of
+computations it supports.

 Eg a special remote could choose to always run a computation inside a
 particular container system and then if you trust that container system is
-secure, you can choose to use it.
+secure, you can choose to install it.
+
+Enabling the special remote is not necessary, because a
+repository can be set to autoenable a special remote. In some sense this is
+surprising. I had originally talked about enabling here and then I
+remembered autoenable.
+
+It may be that autoenable should only be allowed for
+special remote programs that the user explicitly whitelists, not only
+installs into PATH. That would break some existing workflows, though
+setting some git configs would not be too hard.
+
+There seems scope for both compute special remotes that execute code that
+comes from the git repository, and ones that only have metadata about the
+computation recorded in the git repository, in a way that cannot let them
+execute arbitrary code under the control of the git repository.
+
+A well-behaved compute special remote that does run code that comes from a
+git repository could require an additional git config to be set to allow it
+to do that.
 """]]
--- a/doc/todo/compute_special_remote/comment_9_2e10caa2ecbba0f53a3ab031a94c9907._comment
+++ b/doc/todo/compute_special_remote/comment_9_2e10caa2ecbba0f53a3ab031a94c9907._comment
@ -0,0 +1,75 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 9"""
+ date="2025-01-27T14:46:43Z"
+ content="""
+Circling back to this, I think the fork in the road is whether this is
+about git-annex providing this and that feature to support external special
+remotes that compute, or whether git-annex gets a compute special
+remote of its own with some simpler/better extension interface
+than the external special remote protocol.
+
+Of course, git-annex having its own compute special remote would not
+preclude other external special remotes that compute. And for that matter,
+a single external special remote could implement an extension interface.
+
+---
+
+Thinking about how a generic compute special remote in git-annex could
+work, multiple instances of it could be initremoted:
+
+	git-annex initremote convertfiles type=compute program=csv-to-xslx
+	git-annex initremote cutvideo type=compute program=ffmpeg-cut
+
+Here the "program" parameter would cause a program like 
+`git-annex-compute-ffmpeg-cut` to be run to get files from that instance
+of the compute special remote. The interface could be as simple as it
+being run with the key that it is requested to compute, and outputting
+the paths to the all keys it was able to compute. (So allowing for 
+"request one key, receive many".) Perhaps also with some way to indicate
+progess of the computation.
+
+It would make sense to store the details of computations in git-annex
+metadata. And a compute program can use git-annex commands to get files
+it depends on. Eg, `git-annex-compute-ffmpeg-cut` could run:
+
+	# look up the configured metadata
+	starttime=$(git-annex metadata --get compute-ffmpeg-starttime --key=$requested)
+	endtime=$(git-annex metadata --get compute-ffmpeg-endtime --key=$requested)
+	source=$(git-annex metadata --get compute-ffmpeg-source --key=$requested)
+
+	# get the source video file
+	git-annex get --key=$source
+	git-annex examinekey --format='${objectpath}' $source
+
+It might be worth formalizing that a given computed key can depend on other
+keys, and have git-annex always get/compute those keys first. And provide
+them to the program in a worktree?
+
+When asked to store a key in the compute special remote, it would verify
+that the key can be generated by it. Using the same interface as used to
+get a key.
+
+This all leaves a chicken and egg problem, how does the user add a computed
+file if they don't know the key yet?
+
+The user could manually run the commands that generate the computed file,
+then `git-annex add` it, and set the metadata. Then `git-annex copy --to`
+the compute remote would verify if the file can be generated, and add it if
+so. This seems awkward, but also nice to be able to do manually.
+
+Or, something like VURL keys could be used, with an interface something
+like this:
+
+	git-annex addcomputed foo --to ffmpeg-cut
+		--input compute-ffmpeg-source=input.mov 
+		--set compute-ffmpeg-starttime=15:00
+		--set compute-ffmpeg-endtime=30:00
+
+All that would do is generate some arbitrary VURL key or similar,
+provisionally set the provided metadata (how?), and try to store the key
+in the compute special remote. If it succeeds, stage an annex pointer
+and commit the metadata. Since it's a VURL key, storing the key in the
+compute special remote would also record the hash of the generated file
+at that point.
+"""]]
--- a/doc/todo/generic_p2p_socket_transport/comment_6_4641d3ad4a8a8f17f8df47e02555dfa2._comment
+++ b/doc/todo/generic_p2p_socket_transport/comment_6_4641d3ad4a8a8f17f8df47e02555dfa2._comment
@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="matrss"
+ avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
+ subject="comment 6"
+ date="2025-01-27T15:26:15Z"
+ content="""
+> > If the PSK were fully contained in the remote string then a third-party getting hold of that string could pretend to be the server
+
+> I agree this would be a problem, but how would a third-party get ahold of the string though? Remote urls don't usually get stored in the git repository, perhaps you were thinking of some other way.
+
+My thinking was that git remote URLs usually aren't sensitive information that inherently grant access to a repository, so a construct where the remote URL contains the credentials is just unexpected. A careless user might e.g. put it into a `type=git` special remote or treat it in some other way in which one wouldn't treat a password, without considering the implications. I am not aware of a way in which they could be leaked without user intervention, though.
+
+Having separate credentials explicitly named as such just seems safer. But in the end this would be the responsibility of the one implementing the p2p transport, anyway.
+"""]]