merge in doc changes from master

This commit is contained in:
Joey Hess 2025-01-29 18:57:25 -04:00
parent 9e4314de76
commit cbb6df35aa
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
38 changed files with 1136 additions and 27 deletions

View file

@ -8,17 +8,24 @@ Some commands like `git-annex find` use RawFilePath end-to-end.
But this conversion is not yet complete. This is a todo to keep track of the
status.
* The Abstract FilePath proposal (AFPP) has been implemented, and so a number of
libraries like unix and directory now have versions that operate on
OSPath. That could be used in git-annex eg for things like
getDirectoryContents, when built against those versions.
(But OSPath uses ShortByteString, while RawFilePath is ByteString, so
conversion still entails a copy.)
* withFile remains to be converted, and is used in several important code
paths, including Annex.Journal and Annex.Link.
There is a RawFilePath version in file-io library, but that is
not currently a git-annex dependency. (withFile is in base, and base is
unlikely to convert to AFPP soon)
* unix has modules that operate on RawFilePath but no OSPath versions yet.
See https://github.com/haskell/unix/issues/240
* filepath-1.4.100 implements support for OSPath. It is bundled with
ghc-9.6.1 and above. Will need to switch from filepath-bytestring to
this, and to avoid a lot of ifdefs, probably only after git-annex no
longers supports building with older ghc versions. This will entail
replacing all the RawFilePath with OsPath, which should be pretty
mechanical, with only some wrapper functions in Utility.FileIO and
Utility.RawFilePath needing to be changed.
* Utility.FileIO is used for most withFile and openFile, but not yet for
readFile, writeFile, and appendFile on FilePaths.
Note that the FilePath versions do newline translation on windows,
which has to be handled when converting to the Utility.FileIO ones.
* System.Directory.OsPath is available with OsPath build flag, but
not yet used, and would eliminate a lot of fromRawFilePaths.
Make Utility.SystemDirectory import it when built with OsPath,
and the remaining 6 hours or work will explain itself..
This has been started in the `ospath` branch.
[[!tag confirmed]]

View file

@ -0,0 +1,70 @@
[[!comment format=mdwn
username="joey"
subject="""comment 10"""
date="2025-01-28T14:06:41Z"
content="""
Using metadata to store the inputs of computations like I did in my example
above seems that it would allow the metadata to be changed later, which
would change the output when a key gets recomputed. That feels surprising,
because metadata could be changed for any reason, without the intention
of affecting a compute special remote.
It might be possible for git-annex to pin down the current state of
metadata (or the whole git-annex branch) and provide the same input to the
computation when it's run again. (Unless `git-annex forget` has caused
that old branch state to be lost..) But it can't fully isolate the program
from all unpinned inputs without using some form of containerization,
which feels out of scope for git-annex.
Instead of using metadata, the input values could be stored in the
per-special-remote state of the generated key. Or the input values could be
encoded in the key itself, but then two computations that generate the same
output would have two different keys, rather than hashing to the same key.
Using a key with a regular hash backend also lets the user find out if the
computation turns out to not be reproducible later for whatever reason;
getting the file from the compute special remote will fail at hash
verification time. Something like a VURL key could still alternatively be
used in cases where reproducibility is not important.
To add a computed file, the interface would look close to the same,
but now the --value options are setting fields in the compute special
remote's state:
git-annex addcomputed foo --to ffmpeg-cut \
--input source=input.mov \
--value starttime=15:00 \
--value endtime=30:00
The values could be provided to the "git-annex-compute-" program with
environment variables.
For `--input source=foo`, it could look up the git-annex key (or git sha1)
of that file, and store that in the state. So it would provide the compute
program with the same data every time. But it could *also* store the
filename. And that allows for a command like this:
git-annex recompute foo --from ffmpeg-cut
Which, when the input.mov file has been changed, would re-run the
computation with the new content of the file, and stage a new version of
the computed file. It could even be used to recompute every file in a tree:
git-annex recompute . --from ffmpeg-cut
Also, that command could let input values be adjusted later:
git-annex recompute foo --from ffmpeg-cut --value starttime=14:50
git commit -m 'include the introduction of the speaker in the clip'
It would also be good to have a command that examines a computed key
and displays the values and inputs. That could be `git-annex whereis`
or perhaps a dedicated command with more structured output:
git-annex examinecompute foo --from ffmpeg-cut
source=input.mov (annex key SHA256--xxxxxxxxx)
starttime=15:00
endtime=30:00
This all feels like it might allow for some useful workflows...
"""]]

View file

@ -0,0 +1,24 @@
[[!comment format=mdwn
username="joey"
subject="""Re: worktree provisioning"""
date="2025-01-28T14:08:29Z"
content="""
@m.risse in your example the "data.nc" file gets new content when
retrieved from the special remote and the source file has changed.
But if you already have data.nc file present in a repository, it
does not get updated immediately when you update the source
"data.grib" file.
So, a drop and re-get of a file changes the version of the file you have
available. For that matter, if the old version has been stored on other
remotes, a get may retrieve either an old or a new version.
That is not intuitive and it makes me wonder if using a
special remote is really a good fit for what you're wanting to do.
In your "cdo" example, it's not clear to me if the new version of the
software generates an identical file to the old, or if it has a bug fix
that causes it to generate a significantly different output. If the two
outputs are significantly different then treating them as the same
git-annex key seems questionable to me.
"""]]

View file

@ -0,0 +1,29 @@
[[!comment format=mdwn
username="joey"
subject="""comment 12"""
date="2025-01-28T15:39:44Z"
content="""
My design so far does not fully support
"Request one key, receive many".
My `git-annex addcomputed` command doesn't handle the case where a
computation generates multiple output files. While the `git-annex-compute-`
command's interface could let it return several computed files, addcomputed
would only adds one file to the name that the user specifies. What is it
supposed to do if the computation generates more than one? Maybe it needs a
way to let a whole directory be populated with the files generated by a
computation. Or a way to specify multiple files to add.
And here's another problem:
Suppose I have one very expensive computation that generates files foo
and bar. And a second, less expensive computation, that also generates foo
(same content) as well as generating baz. Both computations are run on the
same compute special remote. Now if the user runs `git-annex get foo`,
they will be unhappy if it chooses to run the expensive computation,
rather than the less expensive computation.
Since the per-special remote state for a key is used as the computation
input, only one input can be saved for foo's key. So it wouldn't really be
picking between two alernatives, it would just use whatever the current
state for that key is.
"""]]

View file

@ -0,0 +1,24 @@
[[!comment format=mdwn
username="matrss"
avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
subject="comment 13"
date="2025-01-29T09:56:12Z"
content="""
> @m.risse in your example the \"data.nc\" file gets new content when retrieved from the special remote and the source file has changed.
True, that can happen, and the user was explicit in that they either don't care about it (non-checksum backend, URL in my PoC), or do care (checksum backend) and git-annex would fail the checksum verification.
> But if you already have data.nc file present in a repository, it does not get updated immediately when you update the source \"data.grib\" file.
>
> So, a drop and re-get of a file changes the version of the file you have available. For that matter, if the old version has been stored on other remotes, a get may retrieve either an old or a new version. That is not intuitive and it makes me wonder if using a special remote is really a good fit for what you're wanting to do
This I haven't entirely thought through. I'd say if the key uses a non-checksum backend, then it can only be assumed and is the users responsibility that the resulting file is functionally, even if not bit-by-bit, identical. E.g. with netCDF checksums can differ due to small details like chunking, but the data might be the same. With a checksum backend git-annex would just fail the next recompute, but the interactions with copies on other remotes could indeed get confusing.
> In your \"cdo\" example, it's not clear to me if the new version of the software generates an identical file to the old, or if it has a bug fix that causes it to generate a significantly different output. If the two outputs are significantly different then treating them as the same git-annex key seems questionable to me.
Again, two possible cases depending on if the key uses a checksum or a non-checksum backend. With a checksum: if the new version produces the same output everything is fine; if the new version produces different output then git-annex would indicate this discrepancy on the next recompute and the user has to decide how to handle it (probably by checking that the output of the new version is either functionally the same or in some way \"better\" than the old one and updating the repository to record this new key as that file).
Without a checksum backend the user would again have been explicit in that they don't care if the data changes for whatever reason, the key is essentially just a placeholder for the computation without a guarantee about its content.
Something like VURL would be a compromise between the two: it would avoid the upfront cost of computing all files (which might be very expensive), but still instruct git-annex to error out if the checksum changes at some point after the first compute. A regular migration of the computed-files-so-far to a checksum backend could achieve the same.
"""]]

View file

@ -0,0 +1,11 @@
[[!comment format=mdwn
username="matrss"
avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
subject="comment 14"
date="2025-01-29T10:13:59Z"
content="""
Some thoughts regarding your ideas:
- Multiple output files could always be emulated by generating a single archive file and registering additional compute instructions that simply extract each output file from that archive. I think there could be some convenience functionality on the CLI side to set that up and the key of the archive file might not even need to correspond to an actual file in the tree.
- For my use-cases (and I think DataLad at large) it is important to make this feature work across repository boundaries. E.g. I would like to use this feature to build a derived dataset from <https://atris.fz-juelich.de/MeteoCloud/ERA5>, where exactly this conversion from grib to netcdf happens in the compute step. I'd like to have the netcdf outputs as a separate dataset as some users might only be interested in the grib files, and it would scale better when there is more than just one kind of output that can be derived from an input by computation. `git annex get` doesn't work recursively across submodules/subdatasets though, and `datalad get` does not understand keys, just paths (at least so far).
"""]]

View file

@ -3,5 +3,5 @@
subject="""comment 3"""
date="2024-04-30T19:31:35Z"
content="""
See also [[todo/wishlist__58___derived_content_support]].
See also [[todo/wishlist:_derived_content_support]].
"""]]

View file

@ -3,11 +3,30 @@
subject="""comment 6"""
date="2024-04-30T19:53:43Z"
content="""
On trust, it seems to me that if someone chooses to enable a particular
special remote, they are choosing to trust whatever kind of computations it
supports.
On trust, it seems to me that if someone chooses to install a
particular special remote, they are choosing to trust whatever kind of
computations it supports.
Eg a special remote could choose to always run a computation inside a
particular container system and then if you trust that container system is
secure, you can choose to use it.
secure, you can choose to install it.
Enabling the special remote is not necessary, because a
repository can be set to autoenable a special remote. In some sense this is
surprising. I had originally talked about enabling here and then I
remembered autoenable.
It may be that autoenable should only be allowed for
special remote programs that the user explicitly whitelists, not only
installs into PATH. That would break some existing workflows, though
setting some git configs would not be too hard.
There seems scope for both compute special remotes that execute code that
comes from the git repository, and ones that only have metadata about the
computation recorded in the git repository, in a way that cannot let them
execute arbitrary code under the control of the git repository.
A well-behaved compute special remote that does run code that comes from a
git repository could require an additional git config to be set to allow it
to do that.
"""]]

View file

@ -0,0 +1,75 @@
[[!comment format=mdwn
username="joey"
subject="""comment 9"""
date="2025-01-27T14:46:43Z"
content="""
Circling back to this, I think the fork in the road is whether this is
about git-annex providing this and that feature to support external special
remotes that compute, or whether git-annex gets a compute special
remote of its own with some simpler/better extension interface
than the external special remote protocol.
Of course, git-annex having its own compute special remote would not
preclude other external special remotes that compute. And for that matter,
a single external special remote could implement an extension interface.
---
Thinking about how a generic compute special remote in git-annex could
work, multiple instances of it could be initremoted:
git-annex initremote convertfiles type=compute program=csv-to-xslx
git-annex initremote cutvideo type=compute program=ffmpeg-cut
Here the "program" parameter would cause a program like
`git-annex-compute-ffmpeg-cut` to be run to get files from that instance
of the compute special remote. The interface could be as simple as it
being run with the key that it is requested to compute, and outputting
the paths to the all keys it was able to compute. (So allowing for
"request one key, receive many".) Perhaps also with some way to indicate
progess of the computation.
It would make sense to store the details of computations in git-annex
metadata. And a compute program can use git-annex commands to get files
it depends on. Eg, `git-annex-compute-ffmpeg-cut` could run:
# look up the configured metadata
starttime=$(git-annex metadata --get compute-ffmpeg-starttime --key=$requested)
endtime=$(git-annex metadata --get compute-ffmpeg-endtime --key=$requested)
source=$(git-annex metadata --get compute-ffmpeg-source --key=$requested)
# get the source video file
git-annex get --key=$source
git-annex examinekey --format='${objectpath}' $source
It might be worth formalizing that a given computed key can depend on other
keys, and have git-annex always get/compute those keys first. And provide
them to the program in a worktree?
When asked to store a key in the compute special remote, it would verify
that the key can be generated by it. Using the same interface as used to
get a key.
This all leaves a chicken and egg problem, how does the user add a computed
file if they don't know the key yet?
The user could manually run the commands that generate the computed file,
then `git-annex add` it, and set the metadata. Then `git-annex copy --to`
the compute remote would verify if the file can be generated, and add it if
so. This seems awkward, but also nice to be able to do manually.
Or, something like VURL keys could be used, with an interface something
like this:
git-annex addcomputed foo --to ffmpeg-cut
--input compute-ffmpeg-source=input.mov
--set compute-ffmpeg-starttime=15:00
--set compute-ffmpeg-endtime=30:00
All that would do is generate some arbitrary VURL key or similar,
provisionally set the provided metadata (how?), and try to store the key
in the compute special remote. If it succeeds, stage an annex pointer
and commit the metadata. Since it's a VURL key, storing the key in the
compute special remote would also record the hash of the generated file
at that point.
"""]]

View file

@ -0,0 +1,14 @@
[[!comment format=mdwn
username="matrss"
avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
subject="comment 6"
date="2025-01-27T15:26:15Z"
content="""
> > If the PSK were fully contained in the remote string then a third-party getting hold of that string could pretend to be the server
> I agree this would be a problem, but how would a third-party get ahold of the string though? Remote urls don't usually get stored in the git repository, perhaps you were thinking of some other way.
My thinking was that git remote URLs usually aren't sensitive information that inherently grant access to a repository, so a construct where the remote URL contains the credentials is just unexpected. A careless user might e.g. put it into a `type=git` special remote or treat it in some other way in which one wouldn't treat a password, without considering the implications. I am not aware of a way in which they could be leaked without user intervention, though.
Having separate credentials explicitly named as such just seems safer. But in the end this would be the responsibility of the one implementing the p2p transport, anyway.
"""]]