balanced preferred content and --rebalance

This all works fine. But it doesn't check repository sizes yet, and
without repository size checking, once a repository gets full, there
will be no other repository that will want its files.

Use of sha2 seems unncessary, probably alder2 or md5 or crc would have
been enough. Possibly just summing up the bytes of the key mod the number
of repositories would have sufficed. But sha2 is there, and probably
hardware accellerated. I doubt very much there is any security benefit
to using it though. If someone wants to construct a key that will be
balanced onto a given repository, sha2 is certianly not going to stop
them.
This commit is contained in:
Joey Hess 2024-08-09 14:16:09 -04:00
parent 152c87140b
commit 3ce2e95a5f
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
11 changed files with 169 additions and 17 deletions

View file

@ -13,7 +13,7 @@ that entirely:
Existing preferred content expressions such as the one for archive group
have this problem.
So, let's add a new expression: `balanced(group)`
So, let's add a new expression: `balanced=group`
## implementation
@ -47,7 +47,7 @@ repo1 and repo2 want to swap some files between them,
So, we'll want to add some precautions to avoid a lot of data moving around
in such a case:
((balanced(backup) and not (copies=backup:1)) or present
(balanced=backup and not (copies=backup:1)) or present
So once file lands on a backup drive, it stays there, even if more backup
drives change the balancing.
@ -56,7 +56,7 @@ drives change the balancing.
What if we have 5 backup repos and want each key to be stored in 3 of them?
There's a simple change that can support that:
`balanced(group:3)`
`balanced=group:3`
This works the same as before, but rather than just `N mod M`, take
`N+I mod M` where I is [0..2] to get the list of 3 repositories that want a
@ -78,10 +78,10 @@ number of drives. Each file should have 1 copy stored in each datacenter,
on some drive there.
This can be implemented by making a group for each datacenter, which all of
its drives are in, and using `balanced()` to pick the drive that holds the
its drives are in, and using `balanced` to pick the drive that holds the
copy of the file. The preferred content expression would be eg:
((balanced(datacenterA) and not (copies=datacenterA:1)) or present
(balanced=datacenterA and not copies=datacenterA:1) or present
In such a situation, to avoid a `N^2` remote interconnect, there might be a
transfer repository in each datacenter, that is in front of its drives. The
@ -90,7 +90,7 @@ destination drive. How to write a preferred content expression for that?
It might be sufficient to use `copies=datacenterA:1`, so long as the file
reaching any drive in the datacenter is enough. But may want to add
something analagous to `inallgroup=` that checks if a file is in
the place that `balanced()` picks for a group. Eg,
the place that `balanced` picks for a group. Eg,
`balancedgroup=datacenterA` for 1 copy and `balancedgroup=group:datacenterA:2`
for N copies.
@ -143,18 +143,18 @@ the cluster.
In several examples above, we have preferred content expressions in this
form:
((balanced(group:N) and not (copies=group:N)) or present
(balanced=group:N and not copies=group:N) or present
In order to rebalance, that needs to be changed to:
balanced(group:N)
balanced=group:N
What could be done is make `balanced()` usually expand to the former,
but when --rebalance is used, it only expands to the latter.
(Might make the fully balanced behavior available as `fullybalanced()` for
(Might make the fully balanced behavior available as `fullybalanced` for
users who want it, then
`balanced() == ((fullybalanced(group:N) and not (copies=group:N)) or present`
usually and when --rebalance is used, `balanced() == fullybalanced(group:N)`
`balanced=group:N == (fullybalanced=group:N and not copies=group:N) or present`
usually and when --rebalance is used, `balanced=group:N == fullybalanced=group:N)`

View file

@ -76,6 +76,12 @@ Most of these options are accepted by all git-annex commands.
Overrides the mincopies setting.
* `--rebalance`
Changes the behavior of the "balanced" preferred content expression
to be the same as "fullbalanced". When that expression is used,
this can cause a lot of work to be done to rebalance repositories.
* `--time-limit=time`
Limits how long a git-annex command runs. The time can be something

View file

@ -262,6 +262,52 @@ elsewhere to allow removing it).
says it wants them. (Or, if annex.expireunused is set, it may just delete
them.)
* `balanced=groupname[:number]`
Makes content be evenly balanced amoung repositories in the group.
The number is the number of repositories in the group that will
want each file. When not specified, the default is 1.
For this to work, each repository in the group should have its preferred
content set to the same expression. Using `groupwanted` is a good
way to do that.
For example, "balanced=backup:2", when there are 3 members of the backup
group, will make each backup repository want 2/3rds of the files.
The sizes of files are not taken into account, so it's possible for
one repository to get larger than usual files and so fill up before
the other repositories. But files are only wanted by repositories that
have enough free space to hold them. So once a repository is full,
the remaining repositories will have any additional files balanced
amoung them. In order for this to work, you must use
[[git-annex-size]](1) to specify the size of each repository in the
group.
This usually avoids moving files between repositories of the group, even
if that means that things are not optimally balanced. Some of the ways
that it can get out of balance include adding a new repository to the
group, or a file getting copied into more repositories in the group than
the specified number. Running git-annex commands with the `--rebalance`
option will make this expression instead behave like the `fullybalanced`
expression, which will make repositories want to move files around as
necessary in order to get fully balanced.
Note that `not balanced` is a bad thing to put in a preferred content
expression for the same reason `not present` is.
* `fullybalanced=groupname`
This is like `balanced`, but allows moving content between repositories
in the group at any time to keep it fully balanced.
Normally "balanced=groupname:number" is the same as
"(fullybalanced=groupname:number and not copies=groupname:number) or present"
When the `--rebalance` option is used, `balanced` is the same as
`fullybalanced`.
* `anything`
Always matches.
@ -304,6 +350,8 @@ for example `"exclude=* and copies=1"` will be displayed as
[[git-annex-wanted]](1)
[[git-annex-size]](1)
<https://git-annex.branchable.com/preferred_content/>
<https://git-annex.branchable.com/preferred_content/standard_groups/>

View file

@ -58,6 +58,7 @@ it assumes all files that are currently present are preferred content.
Here are recent changes to preferred content expressions, and the version
they were added in.
* "balanced=", "fullybalanced=" 10.20240831
* "securehash" 6.20170228
* "nothing" 6.201600202
* "anything" 5.20150616

View file

@ -30,9 +30,17 @@ Planned schedule of work:
## work notes
* onward to balanced preferred content! But it depends on
[[track_free_space_in_repos_via_git-annex_branch]] so that will be the
first task.
* balanced= and fullybalanced= need to limit the set of repositories to
ones with enough free space to contain a key.
* Add `git-annex size` command.
* Implement [[track_free_space_in_repos_via_git-annex_branch]]
## completed items for August's work on balanced preferred content
* Balanced preferred content basic implementation, including --rebalance
option.
## completed items for August's work on git-annex proxy support for exporttre