Merge branch 'master' into ospath

This commit is contained in:
Joey Hess 2025-02-14 16:28:43 -04:00
commit e8b00faea8
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
8 changed files with 181 additions and 1 deletions

View file

@ -88,7 +88,8 @@ installLibs appbase installedbins replacement_libs libmap = do
-}
otool :: FilePath -> M.Map FilePath FilePath -> [(FilePath, FilePath)] -> LibMap -> IO ([FilePath], [(FilePath, FilePath)], LibMap)
otool appbase installedbins replacement_libs libmap = do
files <- filterM doesFileExist =<< dirContentsRecursive appbase
files <- filterM doesFileExist
=<< (map fromRawFilePath <$> dirContentsRecursive (toRawFilePath appbase))
process [] files replacement_libs libmap
where
want s =

View file

@ -0,0 +1,80 @@
**draft**
The [[special_remotes/compute]] special remote uses this interface to run
compute programs.
When an compute special remote is initremoted, a program is specified:
git-annex initremote myremote type=compute program=foo
That causes `git-annex-compute-foo` to be run to get files from that
compute special remote.
The environment variable `ANNEX_COMPUTE_KEY` is the key that the program
is requested to compute.
The program is run in a temporary directory, which will be cleaned up after it
exits. When it generates the content of a key, it should write it to a file
with the same name as the key, in that directory. Then it should
output the key in a line to stdout.
While usually this will be the requested key, the program can output any
number of other keys as well, all of which will be stored in the git-annex
repository when getting files from the compute special remote. When a
computation generates several files, this allows running it a single time
to get them all.
The program is passed environment variables to provide inputs to the
computation. These are all prefixed with `"ANNEX_COMPUTE_"`.
The names are taken from the `git-annex addcomputed` command that was used to
add a computed file to the repository.
For example, this command:
git-annex addcomputed file.gen --to foo \
--input raw=file.raw --value passes=10
Will result in this environment:
ANNEX_COMPUTE_KEY=SHA256--...
ANNEX_COMPUTE_raw=file.in
ANNEX_COMPUTE_INPUT_raw=/path/.git/annex/objects/..
ANNEX_COMPUTE_passes=10
For security, the program should avoid exposing values from `ANNEX_COMPUTE_*`
variables to the shell unprotected, or otherwise executing them.
The program will also inherit other environment variables
that were set when git-annex was run, like PATH.
Anything that the program outputs to stderr will be displayed to the user.
This stderr should be used for error messages, and possibly computation
output, but not for progress displays, since git-annex has its own progress
displays.
If possible, the program should write the content of the key it is
generating directly to the file, rather than writing to somewhere else and
renaming it at the end. If git-annex sees that the file corresponding to
the key it requested be computed is growing, it will use the file size when
displaying progress to the user.
Alternatively, if the program outputs a number on a line to stdout, this is
taken to be the number of bytes of the requested key that have been computed
so far. Or, the program can output a percentage eg "50%" on a line to stdout
to indicate what percent of the computation has been performed so far.
If the program exits nonzero, nothing it computed will be stored in the
git-annex repository.
An example `git-annex-compute-foo` shell script follows:
#!/bin/sh
set -e
if [ -z "$ANNEX_COMPUTE_passes" || -z "$ANNEX_COMPUTE_INPUT_raw" ]; then
echo "Missing expected inputs" >&2
exit 1
fi
frobnicate --passes="$ANNEX_COMPUTE_passes" \
<"$ANNEX_COMPUTE_INPUT_raw" >"$ANNEX_COMPUTE_KEY"
echo "$ANNEX_COMPUTE_KEY"

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="anarcat"
avatar="http://cdn.libravatar.org/avatar/4ad594c1e13211c1ad9edb81ce5110b7"
subject="similar topic"
date="2025-02-14T17:51:29Z"
content="""
see also [[moving_annex_across_filesystems]]
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="anarcat"
avatar="http://cdn.libravatar.org/avatar/4ad594c1e13211c1ad9edb81ce5110b7"
subject="similar topic"
date="2025-02-14T17:47:01Z"
content="""
see also [[forum/moving_annex_across_filesystems]]
"""]]

View file

@ -24,3 +24,20 @@ I believe that's because rm won't fail to remove files if they are readonly when
Anyways - what's the proper way of doing this? I know I could `git clone` the repository and `git get` everything, but that would create another repository with a new UUID. That's duplication I do not want.
Thanks for the advice! -- [[anarcat]]
Update, years later... The problem with cloning is that it pollutes the history of the git repository, with all that location information duplicated for a repo that is effectively, immediately forgotten.
That said, it's quite nice to use git itself to move the repository, as it provides a more reliable way to do this:
cd /srv
git clone ~/Photos
cd Photos
git annex get
As long as you don't `git-annex-sync`, you don't send the UUID back and I guess it's *possible* to `git-annex-reinit` to recycle the UUID, but I'm not sure it helps with the extra metadata created.
In some cases, however, this is actually what you want: you *are* creating a new repository, even if you're removing the old one. I've found that the actual, safest way to do those transfers is to clone, as sometimes `mv(1)` can fail halfway and then you have an inconsistent copy and you need to restart from scratch.
Furthermore, while it has been stated elsewhere ([[forum/best_way_to_move_a_git_annex_repo_trought_file_system]], [[forum/Relocating_annex_directory]]) that a git-annex "repository is just a collection of files in a directory", I would argue it's not *quite* true. A git-annex repository is quite peculiar: it has hidden files, readonly files and directories, and can have symbolic links. And while those might seem perfectly normal to a seasoned UNIX programmer or system administrator, they trigger a bunch of special edge cases that might confuse a lot of people (like broken links, permission denied errors when removing folders, etc).
The idea that git-annex is "just a normal folder" is nice in theory, but it breaks down in some edge cases, and I think it's important for people to be aware of that, especially when doing special operations like this.

View file

@ -0,0 +1,24 @@
[[!comment format=mdwn
username="joey"
subject="""Re: comment 13"""
date="2025-02-13T16:36:45Z"
content="""
@m.risse earlier you said that it would be bad to
> Silently use the old version of "data.grib", creating a mismatch between
> "data.nc" and "data.grib"
That's what I was getting at when I said:
> But if you already have data.nc file present in a repository, it does not
> get updated immediately when you update the source "data.grib" file.
So just using files from HEAD for the computation is not sufficient to
avoid this kind of mismatch. The user will need some workflow to deal with
it.
Eg, they could recompute data.nc whenever data.grib is updated, and so make a
commit that updates both files together. But if they're doing that, why does
the computation need to use files from HEAD? Recomputing data.nc could just as
well pin the new key of data.grib.
"""]]

View file

@ -0,0 +1,34 @@
[[!comment format=mdwn
username="joey"
subject="""Re: crossing repository boundaries"""
date="2025-02-13T17:01:52Z"
content="""
It could be argued that git-annex should recurse into submodules.
Oddly, I don't remember that anyone has ever tried to make that argument.
If they did it was a long time ago. It may be that datalad has relieved
enough of the pressure in that area that it's not bothering many people.
Anyway, I wouldn't want to tie compute special remotes to changing
git-annex in that way, but I also wouldn't want to rule out adding
useful stuff to git-annex just because it breaches the submodule boundary
in a way that's new to git-annex.
Thinking about a command like this:
git-annex addcomputed foo --to ffmpeg-cut \
--input source=submodule/input.mov \
--value starttime=15:00 \
--value endtime=30:00
That would need to look inside the submodule to find the input key.
When getting the key later, it can't rely on the tree still containing the
same submodules at the same locations. `git mv submodule foo` would break
the computation.
I think that can be dealt with by having it fall back to checking location
logs of all submodules, to find the submodule that knows about a key.
Deleting a submodule would still break the computation, and that seems
difficult to avoid. Seems acceptable.
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="joey"
subject="""comment 17"""
date="2025-02-13T20:10:52Z"
content="""
I've written up a draft interface for programs used by a compute special
remote: [[design/compute_special_remote_interface]]
"""]]