Merge /home/joey/tmp/git-annex into ospath

This commit is contained in:
Joey Hess 2025-01-28 15:29:58 -04:00
commit 917c43f31f
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
19 changed files with 509 additions and 16 deletions

View file

@ -203,7 +203,7 @@ splitKeyNameExtension' keyname = S8.span (/= '.') keyname
{- A filename may be associated with a Key. -} {- A filename may be associated with a Key. -}
newtype AssociatedFile = AssociatedFile (Maybe RawFilePath) newtype AssociatedFile = AssociatedFile (Maybe RawFilePath)
deriving (Show, Read, Eq, Ord) deriving (Show, Eq, Ord)
{- There are several different varieties of keys. -} {- There are several different varieties of keys. -}
data KeyVariety data KeyVariety

View file

@ -0,0 +1,65 @@
### Please describe the problem.
I have in my mailbox
```
80 T Jan 26 GitHub Actions *-3.6* (3.7K/0) datalad/git-annex daily summary: 4 FAILED, 8 INCOMPLETE, 1 PASSED, 3 ABSENT
206 N T Jan 25 GitHub Actions *-3.8* (3.7K/0) datalad/git-annex daily summary: 4 FAILED, 8 INCOMPLETE, 1 PASSED, 3 ABSENT
357 T Jan 24 GitHub Actions *-4.4* (6.3K/0) datalad/git-annex daily summary: 12 FAILED, 8 INCOMPLETE, 1 PASSED, 3 ABSENT
1279 T Jan 23 GitHub Actions *-4.5* (3.7K/0) datalad/git-annex daily summary: 5 FAILED, 8 INCOMPLETE, 3 ABSENT
1715 T Jan 22 GitHub Actions *-5.0* (3.7K/0) datalad/git-annex daily summary: 5 FAILED, 8 INCOMPLETE, 3 ABSENT 2335 T Jan 21 GitHub Actions *-3.9* (3.7K/0) datalad/git-annex daily summary: 5 FAILED, 8 INCOMPLETE, 3 ABSENT
2656 T Jan 20 GitHub Actions *-4.3* (6.8K/0) datalad/git-annex daily summary: 28 PASSED, 2 ABSENT
2862 T Jan 19 GitHub Actions *-5.0* (6.8K/0) datalad/git-annex daily summary: 28 PASSED, 2 ABSENT
```
and looking at the [latest ubuntu build logs](https://github.com/datalad/git-annex/actions/runs/12970824274/job/36176536041) I see
```
I: the tail of the log
Build/LinuxMkLibs.hs:101:17: error:
Variable not in scope:
createDirectoryIfMissing :: Bool -> [Char] -> IO a3
|
101 | createDirectoryIfMissing True (top ++ libdir </> takeDirectory d)
| ^^^^^^^^^^^^^^^^^^^^^^^^
Build/LinuxMkLibs.hs:149:9: error:
Variable not in scope:
createDirectoryIfMissing :: Bool -> FilePath -> IO a2
|
149 | createDirectoryIfMissing True (top </> shimdir)
| ^^^^^^^^^^^^^^^^^^^^^^^^
Build/LinuxMkLibs.hs:150:9: error:
Variable not in scope:
createDirectoryIfMissing :: Bool -> FilePath -> IO a1
|
150 | createDirectoryIfMissing True (top </> exedir)
| ^^^^^^^^^^^^^^^^^^^^^^^^
Build/LinuxMkLibs.hs:160:19: error:
* Variable not in scope:
renameFile :: FilePath -> FilePath -> IO ()
* Perhaps you meant `readFile' (imported from Prelude)
|
160 | , renameFile exe exedest
| ^^^^^^^^^^
Build/LinuxMkLibs.hs:165:18: error:
Variable not in scope: doesFileExist :: FilePath -> IO Bool
|
165 | unlessM (doesFileExist (top </> exelink)) $
| ^^^^^^^^^^^^^
Build/LinuxMkLibs.hs:181:9: error:
Variable not in scope:
createDirectoryIfMissing :: Bool -> FilePath -> IO a0
|
181 | createDirectoryIfMissing True destdir
| ^^^^^^^^^^^^^^^^^^^^^^^^
make[3]: *** [Makefile:156: Build/Standalone] Error 1
make[3]: Leaving directory '/home/runner/work/git-annex/git-annex/git-annex-source'
make[2]: *** [Makefile:164: linuxstandalone] Error 2
```

View file

@ -4,13 +4,13 @@ I have a pretty big repository with around 300 000 files in the workdir of a bra
I wanted to unlock all those files from that branch on a machine, so I tried to use git-annex-adjust --unlock. I wanted to unlock all those files from that branch on a machine, so I tried to use git-annex-adjust --unlock.
Sadly, the command do not seems to finish, ever. Sadly, the command do not seems to finish, ever.
Executing the command with debug from a clone(to avoid interacting with the broken index from the first), it seems to deadlock after executing between 10000 and 20000 "thawing" processes when executing the filter-process logic over the files in the worktree. Executing the command with the debug flag from a clone(to avoid interacting with the broken index from the first), it seems to deadlock after executing 10240 completed processes for the filter-process logic over the files in the worktree, which happens to match the annex.queuesize configuration value in use in those repositories.
The problem seems to be reproducible with any repository with a lot of files in the worktree as far as I can tell, independant of file size. The problem seems to be reproducible with any repository with more than the aforementioned count of files in the worktree as far as I can tell, independant of file size.
The deadlock described makes higher-level commands like git annex sync also block indefinitely when checkout-ing the unlocked branch for any reason. The deadlock described makes higher-level commands like git annex sync also block indefinitely when checkout-ing the unlocked branch for any reason in these kinds of unlocked repository du to implcit call to the deadlocking git-annex smudge code.
Also, because the filtering is not completely applied, the index is pretty scrambled, its easier to clone the repo and move the annex than fix it, for me at least. Also, because the filtering is not completely applied, the index is pretty scrambled, its easier to clone the repo and move the annex than fix it, for me at least.
I call the behavior "deadlock" due to the absence of debug log output and low cpu usage on the process when in that state. This seems to indicate some kind of multiprocessing deadlock to me. I call the behavior "deadlock" due to the absence of debug log output after the 10240 th process and 0% cpu usage on the remaining git and git-annex processes when the bug happens. This seems to indicate some kind of multiprocessing deadlock to me.
### What steps will reproduce the problem? ### What steps will reproduce the problem?
@ -27,10 +27,13 @@ Here is a minimum set of bash commands that generate the deadlock on my end:
git annex add git annex add
git commit -m "add all empty files" git commit -m "add all empty files"
# This will get stuck after around ~10000-20000 processes from Utility.Process in the debug log while the git annex thaws files into unlocked files # This will get stuck after 10240 processes from Utility.Process completed in the debug log while git annex thaws files into unlocked files
# The deadlock seems to happens after outputing the start of a new thawing, ctrl-c seems to be the only end state for this # The deadlock seems to happens after outputing the start of the last thawing in the queue, ctrl-c seems to be the only end state for this
git annex adjust --unlock --debug git annex adjust --unlock --debug 2> ~/unlock-log
# Ctrl-c the command above once the debug output cease to output new lines without exiting.
# This commands output the number of processes ran for the command above, which is 10240 for me
cat ~/unlock-log | grep Perms | wc -l
### What version of git-annex are you using? On what operating system? ### What version of git-annex are you using? On what operating system?
@ -64,14 +67,15 @@ Debian Bookworm [Compiled via "building from source on Debian"]
### Please provide any additional information below. ### Please provide any additional information below.
Excerpt of the last lines from the huge debug log: Excerpt of the last lines from the huge debug log from the git annex adjust above:
[2025-01-16 23:30:27.913022014] (Utility.Process) process [493397] done ExitSuccess [2025-01-16 23:30:27.913022014] (Utility.Process) process [493397] done ExitSuccess
[2025-01-16 23:30:27.91309169] (Annex.Perms) thawing content .git/annex/othertmp/BKQKGR.0/BKQKGR [2025-01-16 23:30:27.91309169] (Annex.Perms) thawing content .git/annex/othertmp/BKQKGR.0/BKQKGR
Given the huge debug log produced, it may be easier to reproduce the bug to have it than copying it here. If wanted, I can generate one as required. Given the huge debug log produced for this bug, it may be easier to reproduce the bug to have it than copying it here. If wanted, I can generate one as required with the process documented in for the bug repoduction above.
Repeatedly calling this(and ctrl-c it when it inevitably get stuck) seems to eventually unlock the files, but its not really a valid solution in my case.
Repeatedly calling this(and ctrl-c it when it inevitably get stuck) seems to eventually unlock the files ion batches of 10240, but its not really a valid solution in my case.
git annex smudge --update --debug git annex smudge --update --debug

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="matrss"
avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
subject="comment 3"
date="2025-01-27T15:08:57Z"
content="""
I can still reproduce this issue with 10.20250115, but in my testing it seems like it only happens against a forgejo-aneksajo instance on localhost without TLS, not against a different remote instance. This setup required `git config annex.security.allowed-ip-addresses 127.0.0.1`, maybe it has something to do with that or TLS...
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="matrss"
avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
subject="comment 4"
date="2025-01-27T15:14:44Z"
content="""
It definitely takes a different code path somehow, as I don't see the `Utility.Url` debug messages when the remote is not on localhost.
"""]]

View file

@ -0,0 +1,69 @@
### Please describe the problem.
When setting up an (SSH) rsync remote, and _not_ adding the `:` at the end of the hostname, it will create a local folder instead of copying to remote.
```
[joe@laptop]$ git annex initremote myremote type=rsync rsyncurl=ssh.example.com encryption=hybrid keyid=00001111222233334444
[joe@laptop]$ git annex copy . --to myremote
copy metal-arm64.raw (to rpi50...)
ok
copy nixos-gnome-24.11.712512.3f0a8ac25fb6-x86_64-linux.iso (to myremote...)
ok
(recording state in git...)
[joe@laptop]$ ls -l
insgesamt 246792
lrwxrwxrwx. 1 joe joe 204 20. Jan 21:01 metal-arm64.raw -> .git/annex/objects/mG/21/SHA256E-s1306525696--21308f635774faf611ba35c9b04d638aeb7afb1b1c1db949ae65ff81cdafe8b7.raw/SHA256E-s1306525696--21308f635774faf611ba35c9b04d638aeb7afb1b1c1db949ae65ff81cdafe8b7.raw
lrwxrwxrwx. 1 joe joe 204 20. Jan 21:01 nixos-gnome-24.11.712512.3f0a8ac25fb6-x86_64-linux.iso -> .git/annex/objects/fX/g9/SHA256E-s2550136832--da2fe173a279d273bf5a999eafdb618db0642f4a3df95fd94a6585c45082a7f0.iso/SHA256E-s2550136832--da2fe173a279d273bf5a999eafdb618db0642f4a3df95fd94a6585c45082a7f0.iso
drwxr-xr-x. 1 joe joe 12 26. Jan 11:32 ssh.example.com # <---- for me, that was not expected behaviour
```
It might be a feature I don't understand, but because I couldn't find documentation about it, I am leaning towards non-intended behaviour. My assumption would be, that a rsync operation to a local directory is already implemented with the [directory special remote](https://git-annex.branchable.com/special_remotes/directory/).
### What steps will reproduce the problem?
Have a remote rsync server, where you don't need to specify the base directory. In my case [this is done with NixOS and this configuration which uses `rrsync`](https://wiki.nixos.org/wiki/Rsync).
The following configures the rsync remote, and later pushed files to it (so far expected behaviour):
```
git annex initremote myremote type=rsync rsyncurl=ssh.example.com: encryption=hybrid keyid=00001111222233334444
git annex copy . --to myremote
```
This however, doesn't copy to the correct remote, but creates a local folder named `ssh.example.com` in my annexed directory instead (note the missing `:` after the hostname):
```
git annex initremote myremote type=rsync rsyncurl=ssh.example.com encryption=hybrid keyid=00001111222233334444
git annex copy . --to myremote # will copy successfully, BUT
ls -l # shows the folder `ssh.example.com` in my directory with the files in it, the rsync remote is empty
```
### What version of git-annex are you using? On what operating system?
* Fedora 41
```
git-annex version: 10.20240701
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV
dependency versions: aws-0.24.1 bloomfilter-2.0.1.2 crypton-0.34 DAV-1.3.4 feed-1.3.2.1 ghc-9.6.6 http-client-0.7.17 persistent-sqlite-2.13.3.0 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 10
```
### Please provide any additional information below.
[[!format sh """
# If you can, paste a complete transcript of the problem occurring here.
# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log
# End of transcript or log.
"""]]
### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
I am just now starting to _really_ use git-annex, after following it's development and every blogpost you wrote about it for almost a decade now. Thank you for a tool desperately needed!

View file

@ -0,0 +1,12 @@
[[!comment format=mdwn
username="matrss"
avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
subject="comment 1"
date="2025-01-27T11:28:43Z"
content="""
I'd say this is intended behavior: I assume that the rsyncurl option is more less passed verbatim to rsync, and rsync can act on both local and remote paths. There is the possibility to use `rsync://` URLs, remote paths via SSH where the host and path are separated by a colon, and local paths.
The rsync special remote with local paths behaves a bit differently than the directory special remote, namely the rsyncurl is remembered (e.g. for autoenable) while the directory special remote does not remember the directory. There can be use-cases for both.
Besides, most of the time I think one would want to specify a remote directory with rsync, in which case the colon is necessary anyway.
"""]]

View file

@ -0,0 +1,10 @@
[[!comment format=mdwn
username="Atemu"
avatar="http://cdn.libravatar.org/avatar/6ac9c136a74bb8760c66f422d3d6dc32"
subject="comment 1"
date="2025-01-26T02:36:51Z"
content="""
It will not realise this.
Why do you have separate repos for this though? You can absolutely just use a non-plain git repo for synchronisation purposes too.
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="jnkl"
avatar="http://cdn.libravatar.org/avatar/2ab576f3bf2e0d96b1ee935bb7f33dbe"
subject="comment 2"
date="2025-01-26T13:09:04Z"
content="""
Sorry, I am new to git. I thought pushes are only allowed to bare repositories. Am I wrong?
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="Atemu"
avatar="http://cdn.libravatar.org/avatar/6ac9c136a74bb8760c66f422d3d6dc32"
subject="comment 3"
date="2025-01-26T13:30:10Z"
content="""
git-annex synchronises branch state via the `synced/branchnamehere` branches. The actual checked out branch in the worktree will only be updated when you run a `merge` or `sync` in the worktree.
"""]]

View file

@ -0,0 +1,26 @@
[[!comment format=mdwn
username="Atemu"
avatar="http://cdn.libravatar.org/avatar/6ac9c136a74bb8760c66f422d3d6dc32"
subject="comment 2"
date="2025-01-26T02:54:18Z"
content="""
My issue apparently had to do with numcopies? I first passed `--numcopies 2` because I was curious but it didn't change anything. Then I passed `--numcopies 1` and it immediately dropped all the files as I'd have expected it to at `numcopies=3`. Running another sync without `--numcopies` didn't attempt to pull in the dropped files either.
This smells like a bug? If numcopies was actually violated, it should attempt to correct that again, right? (All files were available from a connected repo.)
Here are the numcopies stats from `git annex info .`:
```
numcopies stats:
numcopies +1: 1213
numcopies +0: 25310
```
Some more background: I have a bunch of drives that are offline that I have set to be trusted. One repo on my NAS is online at all times and semitrusted.
I have two offline groups: `cold` and `lukewarm`. All drives in those groups are trusted.
It's weird that it didn't work with 2 but did work with 1. This leads me to believe it could have been due to the one repo being online while the others are offline and trusted; acting more like mincopies. Was behaviour changed in this regard recently?
I'd still like to know how to debug wanted expressions too though.
"""]]

View file

@ -0,0 +1,44 @@
[[!comment format=mdwn
username="beryllium@5bc3c32eb8156390f96e363e4ba38976567425ec"
nickname="beryllium"
avatar="http://cdn.libravatar.org/avatar/62b67d68e918b381e7e9dd6a96c16137"
subject="Simple config amendment for Apache served repositories"
date="2025-01-28T08:34:40Z"
content="""
If you follow the [git-http-backend][id] documentation for serving repositories via Apache, you'll read this section:
<code>
<p>To serve gitweb at the same url, use a ScriptAliasMatch to only
those URLs that <em>git http-backend</em> can handle, and forward the
rest to gitweb:</p>
</div>
<div class=\"listingblock\">
<div class=\"content\">
<pre>ScriptAliasMatch \
\"(?x)^/git/(.*/(HEAD | \
info/refs | \
objects/(info/[^/]+ | \
[0-9a-f]{2}/[0-9a-f]{38} | \
pack/pack-[0-9a-f]{40}\.(pack|idx)) | \
git-(upload|receive)-pack))$\" \
/usr/libexec/git-core/git-http-backend/$1
ScriptAlias /git/ /var/www/cgi-bin/gitweb.cgi/</pre>
</div>
</div>
</code>
If you add the following AliasMatch between the two ScriptAlias directives, you can get Apache to serve the (...).git/config file to the http client, in this case git-annex.
<pre>
AliasMatch \"(?x)^/git/(.*/config)$\" /var/www/git/$1
</pre>
This allows the annexes to use the autoenable=true to pin the centralisation afforded by the git only repository. Keep a \"source of truth\" so to speak (acknowledging that this is antithetical to what git-annex aims to do).
As an aside, the tip to generate a uuid didn't seem to work for me. But I suspect I missed the point somewhat.
Regardless, if you are able to alter the configuration of your \"centralised\" git repository, this might be of assistance.
[id]: https://git-scm.com/docs/git-http-backend \"git-http-backend\"
"""]]

View file

@ -0,0 +1,70 @@
[[!comment format=mdwn
username="joey"
subject="""comment 10"""
date="2025-01-28T14:06:41Z"
content="""
Using metadata to store the inputs of computations like I did in my example
above seems that it would allow the metadata to be changed later, which
would change the output when a key gets recomputed. That feels surprising,
because metadata could be changed for any reason, without the intention
of affecting a compute special remote.
It might be possible for git-annex to pin down the current state of
metadata (or the whole git-annex branch) and provide the same input to the
computation when it's run again. (Unless `git-annex forget` has caused
that old branch state to be lost..) But it can't fully isolate the program
from all unpinned inputs without using some form of containerization,
which feels out of scope for git-annex.
Instead of using metadata, the input values could be stored in the
per-special-remote state of the generated key. Or the input values could be
encoded in the key itself, but then two computations that generate the same
output would have two different keys, rather than hashing to the same key.
Using a key with a regular hash backend also lets the user find out if the
computation turns out to not be reproducible later for whatever reason;
getting the file from the compute special remote will fail at hash
verification time. Something like a VURL key could still alternatively be
used in cases where reproducibility is not important.
To add a computed file, the interface would look close to the same,
but now the --value options are setting fields in the compute special
remote's state:
git-annex addcomputed foo --to ffmpeg-cut \
--input source=input.mov \
--value starttime=15:00 \
--value endtime=30:00
The values could be provided to the "git-annex-compute-" program with
environment variables.
For `--input source=foo`, it could look up the git-annex key (or git sha1)
of that file, and store that in the state. So it would provide the compute
program with the same data every time. But it could *also* store the
filename. And that allows for a command like this:
git-annex recompute foo --from ffmpeg-cut
Which, when the input.mov file has been changed, would re-run the
computation with the new content of the file, and stage a new version of
the computed file. It could even be used to recompute every file in a tree:
git-annex recompute . --from ffmpeg-cut
Also, that command could let input values be adjusted later:
git-annex recompute foo --from ffmpeg-cut --value starttime=14:50
git commit -m 'include the introduction of the speaker in the clip'
It would also be good to have a command that examines a computed key
and displays the values and inputs. That could be `git-annex whereis`
or perhaps a dedicated command with more structured output:
git-annex examinecompute foo --from ffmpeg-cut
source=input.mov (annex key SHA256--xxxxxxxxx)
starttime=15:00
endtime=30:00
This all feels like it might allow for some useful workflows...
"""]]

View file

@ -0,0 +1,24 @@
[[!comment format=mdwn
username="joey"
subject="""Re: worktree provisioning"""
date="2025-01-28T14:08:29Z"
content="""
@m.risse in your example the "data.nc" file gets new content when
retrieved from the special remote and the source file has changed.
But if you already have data.nc file present in a repository, it
does not get updated immediately when you update the source
"data.grib" file.
So, a drop and re-get of a file changes the version of the file you have
available. For that matter, if the old version has been stored on other
remotes, a get may retrieve either an old or a new version.
That is not intuitive and it makes me wonder if using a
special remote is really a good fit for what you're wanting to do.
In your "cdo" example, it's not clear to me if the new version of the
software generates an identical file to the old, or if it has a bug fix
that causes it to generate a significantly different output. If the two
outputs are significantly different then treating them as the same
git-annex key seems questionable to me.
"""]]

View file

@ -0,0 +1,29 @@
[[!comment format=mdwn
username="joey"
subject="""comment 12"""
date="2025-01-28T15:39:44Z"
content="""
My design so far does not fully support
"Request one key, receive many".
My `git-annex addcomputed` command doesn't handle the case where a
computation generates multiple output files. While the `git-annex-compute-`
command's interface could let it return several computed files, addcomputed
would only adds one file to the name that the user specifies. What is it
supposed to do if the computation generates more than one? Maybe it needs a
way to let a whole directory be populated with the files generated by a
computation. Or a way to specify multiple files to add.
And here's another problem:
Suppose I have one very expensive computation that generates files foo
and bar. And a second, less expensive computation, that also generates foo
(same content) as well as generating baz. Both computations are run on the
same compute special remote. Now if the user runs `git-annex get foo`,
they will be unhappy if it chooses to run the expensive computation,
rather than the less expensive computation.
Since the per-special remote state for a key is used as the computation
input, only one input can be saved for foo's key. So it wouldn't really be
picking between two alernatives, it would just use whatever the current
state for that key is.
"""]]

View file

@ -3,5 +3,5 @@
subject="""comment 3""" subject="""comment 3"""
date="2024-04-30T19:31:35Z" date="2024-04-30T19:31:35Z"
content=""" content="""
See also [[todo/wishlist__58___derived_content_support]]. See also [[todo/wishlist:_derived_content_support]].
"""]] """]]

View file

@ -3,11 +3,30 @@
subject="""comment 6""" subject="""comment 6"""
date="2024-04-30T19:53:43Z" date="2024-04-30T19:53:43Z"
content=""" content="""
On trust, it seems to me that if someone chooses to enable a particular On trust, it seems to me that if someone chooses to install a
special remote, they are choosing to trust whatever kind of computations it particular special remote, they are choosing to trust whatever kind of
supports. computations it supports.
Eg a special remote could choose to always run a computation inside a Eg a special remote could choose to always run a computation inside a
particular container system and then if you trust that container system is particular container system and then if you trust that container system is
secure, you can choose to use it. secure, you can choose to install it.
Enabling the special remote is not necessary, because a
repository can be set to autoenable a special remote. In some sense this is
surprising. I had originally talked about enabling here and then I
remembered autoenable.
It may be that autoenable should only be allowed for
special remote programs that the user explicitly whitelists, not only
installs into PATH. That would break some existing workflows, though
setting some git configs would not be too hard.
There seems scope for both compute special remotes that execute code that
comes from the git repository, and ones that only have metadata about the
computation recorded in the git repository, in a way that cannot let them
execute arbitrary code under the control of the git repository.
A well-behaved compute special remote that does run code that comes from a
git repository could require an additional git config to be set to allow it
to do that.
"""]] """]]

View file

@ -0,0 +1,75 @@
[[!comment format=mdwn
username="joey"
subject="""comment 9"""
date="2025-01-27T14:46:43Z"
content="""
Circling back to this, I think the fork in the road is whether this is
about git-annex providing this and that feature to support external special
remotes that compute, or whether git-annex gets a compute special
remote of its own with some simpler/better extension interface
than the external special remote protocol.
Of course, git-annex having its own compute special remote would not
preclude other external special remotes that compute. And for that matter,
a single external special remote could implement an extension interface.
---
Thinking about how a generic compute special remote in git-annex could
work, multiple instances of it could be initremoted:
git-annex initremote convertfiles type=compute program=csv-to-xslx
git-annex initremote cutvideo type=compute program=ffmpeg-cut
Here the "program" parameter would cause a program like
`git-annex-compute-ffmpeg-cut` to be run to get files from that instance
of the compute special remote. The interface could be as simple as it
being run with the key that it is requested to compute, and outputting
the paths to the all keys it was able to compute. (So allowing for
"request one key, receive many".) Perhaps also with some way to indicate
progess of the computation.
It would make sense to store the details of computations in git-annex
metadata. And a compute program can use git-annex commands to get files
it depends on. Eg, `git-annex-compute-ffmpeg-cut` could run:
# look up the configured metadata
starttime=$(git-annex metadata --get compute-ffmpeg-starttime --key=$requested)
endtime=$(git-annex metadata --get compute-ffmpeg-endtime --key=$requested)
source=$(git-annex metadata --get compute-ffmpeg-source --key=$requested)
# get the source video file
git-annex get --key=$source
git-annex examinekey --format='${objectpath}' $source
It might be worth formalizing that a given computed key can depend on other
keys, and have git-annex always get/compute those keys first. And provide
them to the program in a worktree?
When asked to store a key in the compute special remote, it would verify
that the key can be generated by it. Using the same interface as used to
get a key.
This all leaves a chicken and egg problem, how does the user add a computed
file if they don't know the key yet?
The user could manually run the commands that generate the computed file,
then `git-annex add` it, and set the metadata. Then `git-annex copy --to`
the compute remote would verify if the file can be generated, and add it if
so. This seems awkward, but also nice to be able to do manually.
Or, something like VURL keys could be used, with an interface something
like this:
git-annex addcomputed foo --to ffmpeg-cut
--input compute-ffmpeg-source=input.mov
--set compute-ffmpeg-starttime=15:00
--set compute-ffmpeg-endtime=30:00
All that would do is generate some arbitrary VURL key or similar,
provisionally set the provided metadata (how?), and try to store the key
in the compute special remote. If it succeeds, stage an annex pointer
and commit the metadata. Since it's a VURL key, storing the key in the
compute special remote would also record the hash of the generated file
at that point.
"""]]

View file

@ -0,0 +1,14 @@
[[!comment format=mdwn
username="matrss"
avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0"
subject="comment 6"
date="2025-01-27T15:26:15Z"
content="""
> > If the PSK were fully contained in the remote string then a third-party getting hold of that string could pretend to be the server
> I agree this would be a problem, but how would a third-party get ahold of the string though? Remote urls don't usually get stored in the git repository, perhaps you were thinking of some other way.
My thinking was that git remote URLs usually aren't sensitive information that inherently grant access to a repository, so a construct where the remote URL contains the credentials is just unexpected. A careless user might e.g. put it into a `type=git` special remote or treat it in some other way in which one wouldn't treat a password, without considering the implications. I am not aware of a way in which they could be leaked without user intervention, though.
Having separate credentials explicitly named as such just seems safer. But in the end this would be the responsibility of the one implementing the p2p transport, anyway.
"""]]