This commit is contained in:
Joey Hess 2025-02-19 15:14:52 -04:00
parent ace9944d1c
commit 2f11c65491
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 78 additions and 0 deletions

View file

@ -0,0 +1,13 @@
[[!comment format=mdwn
username="joey"
subject="""comment 18"""
date="2025-02-19T18:29:58Z"
content="""
I've started a `compute` branch which so far has documentation for
the [compute special remote](http://source.git-annex.branchable.com/?p=source.git;a=blob;f=doc/special_remotes/compute.mdwn;hb=refs/heads/compute),
[git-annex addcomputed](http://source.git-annex.branchable.com/?p=source.git;a=blob;f=doc/git-annex-addcomputed.mdwn;hb=refs/heads/compute),
and
[git-annex recompute](http://source.git-annex.branchable.com/?p=source.git;a=blob;f=doc/git-annex-recompute.mdwn;hb=refs/heads/compute)
I am pretty happy with how this design is shaping up.
"""]]

View file

@ -0,0 +1,65 @@
[[!comment format=mdwn
username="joey"
subject="""open questions"""
date="2025-02-19T18:39:41Z"
content="""
One thing that I am unsure about is what should happen if `git-annex get foo`
needs the content of file `bar`, which is not present. Should it get `bar` from
a remote? Or should it fail to get `foo`?
Consider that, in the case of `git-annex get foo --from computeremote`, the
user has asked it to get a file from that particular remote, not from
whatever remote contains `bar`.
If the same compute remote can also compute `bar`, it seems quite reasonable
for `git-annex get foo --from computeremote` to also compute bar. (This is
similar to a single computation that generates two output files, in which
case getting one of them will get both of them.)
And it seems reasonable for `git-annex get foo` with no specified remote
to also get or compute bar, from whereever.
But, there is no way at the level of a special remote to tell the
difference between those two commands.
Maybe the right answer is to define getting a file from a compute
special remote as including getting its inputs from other remotes.
Preferring getting them from the same compute special remote when possible,
and when not, using the lowest cost remote that works, same as `git-annx
get` does.
----
A related problem is that, `foo` might be fairly small, but `bar` very
large. So getting a small object can require getting or generating other
large objects. Getting `bar` might fail because there is not enough space
to meet annex.diskreserve. Or the user might just be surprised that so much
disk space was eaten up. But dropping `bar` after computing `foo` also
doesn't seem like a good idea; the user might want to hang onto their copy
now that they have it, or perhaps move it to some faster remote.
Maybe preferred content is the solution? After computing `foo` with `bar`,
keep the copy of `bar` if the local repository wants it, drop it otherwise.
----
Progress display is also going to be complicated for this. There is no
way in the special remote interface to display the progress for `bar`
while getting `foo`.
Probably the thing to do would be to add together the sizes of both files,
and display a combined progress meter.
It would be ok to not say when it's getting the input file.
This will need a way to set the size for a progress display to larger
than the size of the key.
----
.... All 3 problems above go away if it doesn't automatically get input files
before computations and the computations instead just fail with an error
saying the input file is not present.
But then consider the case where you just want every file in the repository.
`git-annex get .` failing to compute some files because their input files
happen to come after them in the directory listing is not good.
"""]]