Merge /home/joey/tmp/git-annex into ospath
This commit is contained in:
commit
917c43f31f
19 changed files with 509 additions and 16 deletions
|
@ -0,0 +1,70 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 10"""
|
||||
date="2025-01-28T14:06:41Z"
|
||||
content="""
|
||||
Using metadata to store the inputs of computations like I did in my example
|
||||
above seems that it would allow the metadata to be changed later, which
|
||||
would change the output when a key gets recomputed. That feels surprising,
|
||||
because metadata could be changed for any reason, without the intention
|
||||
of affecting a compute special remote.
|
||||
|
||||
It might be possible for git-annex to pin down the current state of
|
||||
metadata (or the whole git-annex branch) and provide the same input to the
|
||||
computation when it's run again. (Unless `git-annex forget` has caused
|
||||
that old branch state to be lost..) But it can't fully isolate the program
|
||||
from all unpinned inputs without using some form of containerization,
|
||||
which feels out of scope for git-annex.
|
||||
|
||||
Instead of using metadata, the input values could be stored in the
|
||||
per-special-remote state of the generated key. Or the input values could be
|
||||
encoded in the key itself, but then two computations that generate the same
|
||||
output would have two different keys, rather than hashing to the same key.
|
||||
|
||||
Using a key with a regular hash backend also lets the user find out if the
|
||||
computation turns out to not be reproducible later for whatever reason;
|
||||
getting the file from the compute special remote will fail at hash
|
||||
verification time. Something like a VURL key could still alternatively be
|
||||
used in cases where reproducibility is not important.
|
||||
|
||||
To add a computed file, the interface would look close to the same,
|
||||
but now the --value options are setting fields in the compute special
|
||||
remote's state:
|
||||
|
||||
git-annex addcomputed foo --to ffmpeg-cut \
|
||||
--input source=input.mov \
|
||||
--value starttime=15:00 \
|
||||
--value endtime=30:00
|
||||
|
||||
The values could be provided to the "git-annex-compute-" program with
|
||||
environment variables.
|
||||
|
||||
For `--input source=foo`, it could look up the git-annex key (or git sha1)
|
||||
of that file, and store that in the state. So it would provide the compute
|
||||
program with the same data every time. But it could *also* store the
|
||||
filename. And that allows for a command like this:
|
||||
|
||||
git-annex recompute foo --from ffmpeg-cut
|
||||
|
||||
Which, when the input.mov file has been changed, would re-run the
|
||||
computation with the new content of the file, and stage a new version of
|
||||
the computed file. It could even be used to recompute every file in a tree:
|
||||
|
||||
git-annex recompute . --from ffmpeg-cut
|
||||
|
||||
Also, that command could let input values be adjusted later:
|
||||
|
||||
git-annex recompute foo --from ffmpeg-cut --value starttime=14:50
|
||||
git commit -m 'include the introduction of the speaker in the clip'
|
||||
|
||||
It would also be good to have a command that examines a computed key
|
||||
and displays the values and inputs. That could be `git-annex whereis`
|
||||
or perhaps a dedicated command with more structured output:
|
||||
|
||||
git-annex examinecompute foo --from ffmpeg-cut
|
||||
source=input.mov (annex key SHA256--xxxxxxxxx)
|
||||
starttime=15:00
|
||||
endtime=30:00
|
||||
|
||||
This all feels like it might allow for some useful workflows...
|
||||
"""]]
|
|
@ -0,0 +1,24 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""Re: worktree provisioning"""
|
||||
date="2025-01-28T14:08:29Z"
|
||||
content="""
|
||||
@m.risse in your example the "data.nc" file gets new content when
|
||||
retrieved from the special remote and the source file has changed.
|
||||
|
||||
But if you already have data.nc file present in a repository, it
|
||||
does not get updated immediately when you update the source
|
||||
"data.grib" file.
|
||||
|
||||
So, a drop and re-get of a file changes the version of the file you have
|
||||
available. For that matter, if the old version has been stored on other
|
||||
remotes, a get may retrieve either an old or a new version.
|
||||
That is not intuitive and it makes me wonder if using a
|
||||
special remote is really a good fit for what you're wanting to do.
|
||||
|
||||
In your "cdo" example, it's not clear to me if the new version of the
|
||||
software generates an identical file to the old, or if it has a bug fix
|
||||
that causes it to generate a significantly different output. If the two
|
||||
outputs are significantly different then treating them as the same
|
||||
git-annex key seems questionable to me.
|
||||
"""]]
|
|
@ -0,0 +1,29 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 12"""
|
||||
date="2025-01-28T15:39:44Z"
|
||||
content="""
|
||||
My design so far does not fully support
|
||||
"Request one key, receive many".
|
||||
|
||||
My `git-annex addcomputed` command doesn't handle the case where a
|
||||
computation generates multiple output files. While the `git-annex-compute-`
|
||||
command's interface could let it return several computed files, addcomputed
|
||||
would only adds one file to the name that the user specifies. What is it
|
||||
supposed to do if the computation generates more than one? Maybe it needs a
|
||||
way to let a whole directory be populated with the files generated by a
|
||||
computation. Or a way to specify multiple files to add.
|
||||
|
||||
And here's another problem:
|
||||
Suppose I have one very expensive computation that generates files foo
|
||||
and bar. And a second, less expensive computation, that also generates foo
|
||||
(same content) as well as generating baz. Both computations are run on the
|
||||
same compute special remote. Now if the user runs `git-annex get foo`,
|
||||
they will be unhappy if it chooses to run the expensive computation,
|
||||
rather than the less expensive computation.
|
||||
|
||||
Since the per-special remote state for a key is used as the computation
|
||||
input, only one input can be saved for foo's key. So it wouldn't really be
|
||||
picking between two alernatives, it would just use whatever the current
|
||||
state for that key is.
|
||||
"""]]
|
|
@ -3,5 +3,5 @@
|
|||
subject="""comment 3"""
|
||||
date="2024-04-30T19:31:35Z"
|
||||
content="""
|
||||
See also [[todo/wishlist__58___derived_content_support]].
|
||||
See also [[todo/wishlist:_derived_content_support]].
|
||||
"""]]
|
||||
|
|
|
@ -3,11 +3,30 @@
|
|||
subject="""comment 6"""
|
||||
date="2024-04-30T19:53:43Z"
|
||||
content="""
|
||||
On trust, it seems to me that if someone chooses to enable a particular
|
||||
special remote, they are choosing to trust whatever kind of computations it
|
||||
supports.
|
||||
On trust, it seems to me that if someone chooses to install a
|
||||
particular special remote, they are choosing to trust whatever kind of
|
||||
computations it supports.
|
||||
|
||||
Eg a special remote could choose to always run a computation inside a
|
||||
particular container system and then if you trust that container system is
|
||||
secure, you can choose to use it.
|
||||
secure, you can choose to install it.
|
||||
|
||||
Enabling the special remote is not necessary, because a
|
||||
repository can be set to autoenable a special remote. In some sense this is
|
||||
surprising. I had originally talked about enabling here and then I
|
||||
remembered autoenable.
|
||||
|
||||
It may be that autoenable should only be allowed for
|
||||
special remote programs that the user explicitly whitelists, not only
|
||||
installs into PATH. That would break some existing workflows, though
|
||||
setting some git configs would not be too hard.
|
||||
|
||||
There seems scope for both compute special remotes that execute code that
|
||||
comes from the git repository, and ones that only have metadata about the
|
||||
computation recorded in the git repository, in a way that cannot let them
|
||||
execute arbitrary code under the control of the git repository.
|
||||
|
||||
A well-behaved compute special remote that does run code that comes from a
|
||||
git repository could require an additional git config to be set to allow it
|
||||
to do that.
|
||||
"""]]
|
||||
|
|
|
@ -0,0 +1,75 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 9"""
|
||||
date="2025-01-27T14:46:43Z"
|
||||
content="""
|
||||
Circling back to this, I think the fork in the road is whether this is
|
||||
about git-annex providing this and that feature to support external special
|
||||
remotes that compute, or whether git-annex gets a compute special
|
||||
remote of its own with some simpler/better extension interface
|
||||
than the external special remote protocol.
|
||||
|
||||
Of course, git-annex having its own compute special remote would not
|
||||
preclude other external special remotes that compute. And for that matter,
|
||||
a single external special remote could implement an extension interface.
|
||||
|
||||
---
|
||||
|
||||
Thinking about how a generic compute special remote in git-annex could
|
||||
work, multiple instances of it could be initremoted:
|
||||
|
||||
git-annex initremote convertfiles type=compute program=csv-to-xslx
|
||||
git-annex initremote cutvideo type=compute program=ffmpeg-cut
|
||||
|
||||
Here the "program" parameter would cause a program like
|
||||
`git-annex-compute-ffmpeg-cut` to be run to get files from that instance
|
||||
of the compute special remote. The interface could be as simple as it
|
||||
being run with the key that it is requested to compute, and outputting
|
||||
the paths to the all keys it was able to compute. (So allowing for
|
||||
"request one key, receive many".) Perhaps also with some way to indicate
|
||||
progess of the computation.
|
||||
|
||||
It would make sense to store the details of computations in git-annex
|
||||
metadata. And a compute program can use git-annex commands to get files
|
||||
it depends on. Eg, `git-annex-compute-ffmpeg-cut` could run:
|
||||
|
||||
# look up the configured metadata
|
||||
starttime=$(git-annex metadata --get compute-ffmpeg-starttime --key=$requested)
|
||||
endtime=$(git-annex metadata --get compute-ffmpeg-endtime --key=$requested)
|
||||
source=$(git-annex metadata --get compute-ffmpeg-source --key=$requested)
|
||||
|
||||
# get the source video file
|
||||
git-annex get --key=$source
|
||||
git-annex examinekey --format='${objectpath}' $source
|
||||
|
||||
It might be worth formalizing that a given computed key can depend on other
|
||||
keys, and have git-annex always get/compute those keys first. And provide
|
||||
them to the program in a worktree?
|
||||
|
||||
When asked to store a key in the compute special remote, it would verify
|
||||
that the key can be generated by it. Using the same interface as used to
|
||||
get a key.
|
||||
|
||||
This all leaves a chicken and egg problem, how does the user add a computed
|
||||
file if they don't know the key yet?
|
||||
|
||||
The user could manually run the commands that generate the computed file,
|
||||
then `git-annex add` it, and set the metadata. Then `git-annex copy --to`
|
||||
the compute remote would verify if the file can be generated, and add it if
|
||||
so. This seems awkward, but also nice to be able to do manually.
|
||||
|
||||
Or, something like VURL keys could be used, with an interface something
|
||||
like this:
|
||||
|
||||
git-annex addcomputed foo --to ffmpeg-cut
|
||||
--input compute-ffmpeg-source=input.mov
|
||||
--set compute-ffmpeg-starttime=15:00
|
||||
--set compute-ffmpeg-endtime=30:00
|
||||
|
||||
All that would do is generate some arbitrary VURL key or similar,
|
||||
provisionally set the provided metadata (how?), and try to store the key
|
||||
in the compute special remote. If it succeeds, stage an annex pointer
|
||||
and commit the metadata. Since it's a VURL key, storing the key in the
|
||||
compute special remote would also record the hash of the generated file
|
||||
at that point.
|
||||
"""]]
|
|
@ -0,0 +1,14 @@
|
|||
[[!comment format=mdwn
|
||||
username="matrss"
|
||||
avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
|
||||
subject="comment 6"
|
||||
date="2025-01-27T15:26:15Z"
|
||||
content="""
|
||||
> > If the PSK were fully contained in the remote string then a third-party getting hold of that string could pretend to be the server
|
||||
|
||||
> I agree this would be a problem, but how would a third-party get ahold of the string though? Remote urls don't usually get stored in the git repository, perhaps you were thinking of some other way.
|
||||
|
||||
My thinking was that git remote URLs usually aren't sensitive information that inherently grant access to a repository, so a construct where the remote URL contains the credentials is just unexpected. A careless user might e.g. put it into a `type=git` special remote or treat it in some other way in which one wouldn't treat a password, without considering the implications. I am not aware of a way in which they could be leaked without user intervention, though.
|
||||
|
||||
Having separate credentials explicitly named as such just seems safer. But in the end this would be the responsibility of the one implementing the p2p transport, anyway.
|
||||
"""]]
|
Loading…
Add table
Add a link
Reference in a new issue