This commit is contained in:
Joey Hess 2025-05-21 14:14:37 -04:00
parent 1c270d3251
commit 0550974185
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 47 additions and 0 deletions

View file

@ -0,0 +1,30 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2025-05-21T17:06:51Z"
content="""
As to the ordering, at first I thought it would make sense for it to
pick the most full repository that still has space for a file.
But: Suppose that the files being processed alternate between large, and
small. The fullest tape is too full for any of the large files, but it can
hold all the small files. The second fullest tape has plenty of room.
In this case, it would constantly switch back and forth between the two tapes.
sizebalanced picks the least full repository. That's not what we want
either, clearly, since it alternates between repositories frequently when
they're near the same size.
The optimal solution is for git-annex to remember what repository was used
to store the last file, and can just use that repository again. Unless it's
full, in which case it can pick any repository that still has space. And
then it will continue to use that new repository for subsequent files.
That memory would necessarily be local to a repository in front of these
tape remotes. (Eg, a cluster gateway). If there were multiple repositories
that were all writing to the same tape remotes, they would each have their
own memory, and chaos would ensue.
Needing a memory makes me a bit dubious about putting this in a preferred
content expression. But in your specific case, I guess it would work.
"""]]

View file

@ -0,0 +1,17 @@
[[!comment format=mdwn
username="joey"
subject="""comment 4"""
date="2025-05-21T18:05:27Z"
content="""
Another approach would be to configure `remote.<name>.annex-cost-command`
with a command that gives a low cost to the tape in the drive, and a high
cost to other tapes.
But git-annex only checks the cost once at startup. It would need to check
it again after each file. Which could be a new configuration setting. You
would need to make the cost command efficent enough that running it once per
file is not too slow.
With this approach, the standard archive group preferred content
would probably suffice.
"""]]