more formal documentation of balancing

This commit is contained in:
Joey Hess 2024-08-11 13:29:06 -04:00
parent bd5affa362
commit 3019b21c40
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 24 additions and 15 deletions

View file

@ -15,22 +15,31 @@ that entirely:
So, let's add a new expression: `balanced=group` So, let's add a new expression: `balanced=group`
## implementation ## how it works
This would work by taking the list of uuids of all repositories in the To decide which repository wants key K:
group that have enough free space to store a key, and sorting them,
which yields a list from 0..M-1 repositories.
(To know if a repo has enough free space to store a key A is the list of UUIDs of all the repositories in the group,
will need [[todo/track_free_space_in_repos_via_git-annex_branch]] in ascending order.
B is A filtered to repositories that have enough free space to store key K.
(Needs [[todo/track_free_space_in_repos_via_git-annex_branch]]
to be implemented.) to be implemented.)
To decide which repository wants key K, convert K to a number N in some S is the concacenation of each UUID in A.
stable way and then `N mod M` yields the number of the repository that
wants it, while all the rest don't.
(Since git-annex keys can be pretty long and not all of them are random N is the HMAC-SHA256 of K and S, with S being the "secret key" and K being
hashes, let's md5sum the key and then use the md5 as a number.) the message.
M is the number of repositories in B.
Then `N mod M` is the index of the repository in B that wants key K.
The purpose of using HMAC-SHA256 here is mostly to evenly distribute
amoung the repositories, since git-annex keys can be pretty long and do not
always contain hashe. Also, including the concacenation of all the UUIDs
of reposotories in the group makes it harder to generate a combination of
key and repository UUID that makes that repository want to contain the key.
## stability ## stability
@ -39,8 +48,8 @@ the members of the group will change which repository is selected. And
changes in how full repositories are will also change which repo is changes in how full repositories are will also change which repo is
selected. selected.
Without stability, when another repo is added to the group, all data will Without stability, when another repo is added to the group, or a repository
be rebalanced, with some moving to it. Which could be desirable in some becomes full, all data will be rebalanced. Which could be desirable in some
situations, but the problem is that it's likely that adding repo3 will make situations, but the problem is that it's likely that adding repo3 will make
repo1 and repo2 want to swap some files between them, repo1 and repo2 want to swap some files between them,

View file

@ -42,8 +42,8 @@ Planned schedule of work:
not occur. Users wanting 2 copies can have 2 groups which are each not occur. Users wanting 2 copies can have 2 groups which are each
balanced, although that would mean more repositories on more drives. balanced, although that would mean more repositories on more drives.
* document balancing algo well enough that someone else could implement it Also note that "fullybalanced=foo:2" is not currently actually
from the design doc implemented!
* Add `git-annex maxsize` command. * Add `git-annex maxsize` command.