Merge branch 'master' into ospath

This commit is contained in:
Joey Hess 2025-01-29 14:24:35 -04:00
commit 76bff3e8d1
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
11 changed files with 519 additions and 7 deletions

View file

@ -0,0 +1,24 @@
[[!comment format=mdwn
username="matrss"
avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
subject="comment 13"
date="2025-01-29T09:56:12Z"
content="""
> @m.risse in your example the \"data.nc\" file gets new content when retrieved from the special remote and the source file has changed.
True, that can happen, and the user was explicit in that they either don't care about it (non-checksum backend, URL in my PoC), or do care (checksum backend) and git-annex would fail the checksum verification.
> But if you already have data.nc file present in a repository, it does not get updated immediately when you update the source \"data.grib\" file.
>
> So, a drop and re-get of a file changes the version of the file you have available. For that matter, if the old version has been stored on other remotes, a get may retrieve either an old or a new version. That is not intuitive and it makes me wonder if using a special remote is really a good fit for what you're wanting to do
This I haven't entirely thought through. I'd say if the key uses a non-checksum backend, then it can only be assumed and is the users responsibility that the resulting file is functionally, even if not bit-by-bit, identical. E.g. with netCDF checksums can differ due to small details like chunking, but the data might be the same. With a checksum backend git-annex would just fail the next recompute, but the interactions with copies on other remotes could indeed get confusing.
> In your \"cdo\" example, it's not clear to me if the new version of the software generates an identical file to the old, or if it has a bug fix that causes it to generate a significantly different output. If the two outputs are significantly different then treating them as the same git-annex key seems questionable to me.
Again, two possible cases depending on if the key uses a checksum or a non-checksum backend. With a checksum: if the new version produces the same output everything is fine; if the new version produces different output then git-annex would indicate this discrepancy on the next recompute and the user has to decide how to handle it (probably by checking that the output of the new version is either functionally the same or in some way \"better\" than the old one and updating the repository to record this new key as that file).
Without a checksum backend the user would again have been explicit in that they don't care if the data changes for whatever reason, the key is essentially just a placeholder for the computation without a guarantee about its content.
Something like VURL would be a compromise between the two: it would avoid the upfront cost of computing all files (which might be very expensive), but still instruct git-annex to error out if the checksum changes at some point after the first compute. A regular migration of the computed-files-so-far to a checksum backend could achieve the same.
"""]]

View file

@ -0,0 +1,11 @@
[[!comment format=mdwn
username="matrss"
avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
subject="comment 14"
date="2025-01-29T10:13:59Z"
content="""
Some thoughts regarding your ideas:
- Multiple output files could always be emulated by generating a single archive file and registering additional compute instructions that simply extract each output file from that archive. I think there could be some convenience functionality on the CLI side to set that up and the key of the archive file might not even need to correspond to an actual file in the tree.
- For my use-cases (and I think DataLad at large) it is important to make this feature work across repository boundaries. E.g. I would like to use this feature to build a derived dataset from <https://atris.fz-juelich.de/MeteoCloud/ERA5>, where exactly this conversion from grib to netcdf happens in the compute step. I'd like to have the netcdf outputs as a separate dataset as some users might only be interested in the grib files, and it would scale better when there is more than just one kind of output that can be derived from an input by computation. `git annex get` doesn't work recursively across submodules/subdatasets though, and `datalad get` does not understand keys, just paths (at least so far).
"""]]