169 lines
6.8 KiB
Markdown
169 lines
6.8 KiB
Markdown
[[!toc ]]
|
|
|
|
## motivations
|
|
|
|
Say we have 2 drives and want to fill them both evenly with files,
|
|
different files in each drive. Currently, preferred content cannot express
|
|
that entirely:
|
|
|
|
* One way is to use a-m* and n-z*, but that's unlikely to split filenames evenly.
|
|
* Or, can let both repos take whatever files, perhaps at random, that the
|
|
other repo is not know to contain, but then repos will race and both get
|
|
the same file, or similarly if they are not communicating frequently.
|
|
Existing preferred content expressions such as the one for archive group
|
|
have this problem.
|
|
|
|
So, let's add a new expression: `balanced=group`
|
|
|
|
## how it works
|
|
|
|
To decide which repository wants key K:
|
|
|
|
A is the list of UUIDs of all the repositories in the group,
|
|
in ascending order.
|
|
|
|
B is A filtered to repositories that have enough free space to store key K.
|
|
(Needs [[todo/track_free_space_in_repos_via_git-annex_branch]]
|
|
to be implemented.)
|
|
|
|
S is the concacenation of each UUID in A.
|
|
|
|
N is the HMAC-SHA256 of K and S, with S being the "secret key" and K being
|
|
the message.
|
|
|
|
M is the number of repositories in B.
|
|
|
|
Then `N mod M` is the index of the repository in B that wants key K.
|
|
|
|
The purpose of using HMAC-SHA256 here is mostly to evenly distribute
|
|
amoung the repositories, since git-annex keys can be pretty long and do not
|
|
always contain hashe. Also, including the concacenation of all the UUIDs
|
|
of reposotories in the group makes it harder to generate a combination of
|
|
key and repository UUID that makes that repository want to contain the key.
|
|
|
|
## stability
|
|
|
|
Note that this preferred content expression will not be stable. A change in
|
|
the members of the group will change which repository is selected. And
|
|
changes in how full repositories are will also change which repo is
|
|
selected.
|
|
|
|
Without stability, when another repo is added to the group, or a repository
|
|
becomes full, all data will be rebalanced. Which could be desirable in some
|
|
situations, but the problem is that it's likely that adding repo3 will make
|
|
repo1 and repo2 want to swap some files between them,
|
|
|
|
So, we'll want to add some precautions to avoid a lot of data moving around
|
|
in such a case:
|
|
|
|
(balanced=backup and not (copies=backup:1)) or present
|
|
|
|
So once file lands on a backup drive, it stays there, even if more backup
|
|
drives change the balancing.
|
|
|
|
## use case: 3 of 5
|
|
|
|
What if we have 5 backup repos and want each key to be stored in 3 of them?
|
|
There's a simple change that can support that:
|
|
`balanced=group:3`
|
|
|
|
This works the same as before, but rather than just `N mod M`, take
|
|
`N+I mod M` where I is [0..2] to get the list of 3 repositories that want a
|
|
key.
|
|
|
|
However, once 3 of those 5 repos get full, new keys will only be able to be
|
|
stored on 2 of them. At that point one or more new repos will need to be
|
|
added to reach the goal of each key being stored in 3 of them. It would be
|
|
possible to rebalance the 3 full repos by moving some keys from them to the
|
|
other 2 repos, and eke out more storage before needing to add new
|
|
repositories. A separate rebalancing pass, that does not use preferred
|
|
content alone, could be implemented to handle this (see below).
|
|
|
|
## use case: geographically distinct datacenters
|
|
|
|
Of course this is not limited to backup drives. A more complicated example:
|
|
There are 4 geographically distributed datacenters, each of which has some
|
|
number of drives. Each file should have 1 copy stored in each datacenter,
|
|
on some drive there.
|
|
|
|
This can be implemented by making a group for each datacenter, which all of
|
|
its drives are in, and using `balanced` to pick the drive that holds the
|
|
copy of the file. The preferred content expression would be eg:
|
|
|
|
(balanced=datacenterA and not copies=datacenterA:1) or present
|
|
|
|
In such a situation, to avoid a `N^2` remote interconnect, there might be a
|
|
transfer repository in each datacenter, that is in front of its drives. The
|
|
transfer repository should want files that have not yet reached the
|
|
destination drive. How to write a preferred content expression for that?
|
|
It might be sufficient to use `copies=datacenterA:1`, so long as the file
|
|
reaching any drive in the datacenter is enough. But may want to add
|
|
something analagous to `inallgroup=` that checks if a file is in
|
|
the place that `balanced` picks for a group. Eg,
|
|
`balancedgroup=datacenterA` for 1 copy and `balancedgroup=group:datacenterA:2`
|
|
for N copies.
|
|
|
|
The [[design/passthrough_proxy]] idea is an alternate way to put a
|
|
repository in front of such a cluster, that does not need additional
|
|
extensions to preferred content.
|
|
|
|
## split brain situations
|
|
|
|
Of course, in the time after the git-annex branch was updated and before
|
|
it reaches the local repo, a repo can be full without us knowing about
|
|
it. Stores to it would fail, and perhaps be retried, until the updated
|
|
git-annex branch was synced.
|
|
|
|
In the worst case, a split brain situation
|
|
can make the balanced preferred content expression
|
|
pick a different repository to hold two independent
|
|
stores of the same key. Eg, when one side thinks one repo is full,
|
|
and the other side thinks the other repo is full.
|
|
|
|
If `present` is used in the preferred content, both of them will then
|
|
want to contain it. (Is `present` really needed like shown in the examples
|
|
above?)
|
|
|
|
If it's not, one of them will drop it and the other will
|
|
usually maintain its copy. It would perhaps be possible for both of
|
|
them to drop it, leading to a re-upload cycle. This needs some research
|
|
to see if it's a real problem.
|
|
See [[todo/proving_preferred_content_behavior]].
|
|
|
|
## rebalancing
|
|
|
|
In both the 3 of 5 use case and a split brain situation, it's possible for
|
|
content to end up not optimally balanced between repositories.
|
|
|
|
(There are also situations where a cluster node ends up without a copy
|
|
of a file that is preferred content, or where adding a copy to a node
|
|
would satisfy numcopies. This can happen eg, when a client sends a file
|
|
to a single node rather than to the cluster. Rebalancing also will deal
|
|
with those.)
|
|
|
|
git-annex can be made to operate in a mode where it does additional work
|
|
to rebalance repositories.
|
|
|
|
This can be an option like --rebalance, that changes how the preferred content
|
|
expression is evaluated. The user can choose where and when to run that.
|
|
Eg, it might be run on a node inside a cluster after adding more storage to
|
|
the cluster.
|
|
|
|
In several examples above, we have preferred content expressions in this
|
|
form:
|
|
|
|
(balanced=group:N and not copies=group:N) or present
|
|
|
|
In order to rebalance, that needs to be changed to:
|
|
|
|
balanced=group:N
|
|
|
|
What could be done is make `balanced()` usually expand to the former,
|
|
but when --rebalance is used, it only expands to the latter.
|
|
|
|
(Might make the fully balanced behavior available as `fullybalanced` for
|
|
users who want it, then
|
|
`balanced=group:N == (fullybalanced=group:N and not copies=group:N) or present`
|
|
usually and when --rebalance is used, `balanced=group:N == fullybalanced=group:N)`
|
|
|
|
|