git-annex/doc/design/balanced_preferred_content.mdwn

161 lines
6.5 KiB
Text
Raw Normal View History

2024-05-01 16:14:59 +00:00
[[!toc ]]
2024-05-01 16:18:14 +00:00
## motivations
2024-03-15 14:13:12 +00:00
Say we have 2 drives and want to fill them both evenly with files,
different files in each drive. Currently, preferred content cannot express
that entirely:
* One way is to use a-m* and n-z*, but that's unlikely to split filenames evenly.
* Or, can let both repos take whatever files, perhaps at random, that the
other repo is not know to contain, but then repos will race and both get
the same file, or similarly if they are not communicating frequently.
2024-03-15 14:13:12 +00:00
Existing preferred content expressions such as the one for archive group
have this problem.
So, let's add a new expression: `balanced=group`
2024-03-15 14:13:12 +00:00
## implementation
This would work by taking the list of uuids of all repositories in the
2024-03-15 14:13:12 +00:00
group that have enough free space to store a key, and sorting them,
which yields a list from 0..M-1 repositories.
(To know if a repo has enough free space to store a key
will need [[todo/track_free_space_in_repos_via_git-annex_branch]]
to be implemented.)
To decide which repository wants key K, convert K to a number N in some
stable way and then `N mod M` yields the number of the repository that
wants it, while all the rest don't.
(Since git-annex keys can be pretty long and not all of them are random
hashes, let's md5sum the key and then use the md5 as a number.)
2024-03-15 14:13:12 +00:00
## stability
Note that this preferred content expression will not be stable. A change in
the members of the group will change which repository is selected. And
changes in how full repositories are will also change which repo is
selected.
2024-03-15 14:13:12 +00:00
Without stability, when another repo is added to the group, all data will
be rebalanced, with some moving to it. Which could be desirable in some
situations, but the problem is that it's likely that adding repo3 will make
repo1 and repo2 want to swap some files between them,
So, we'll want to add some precautions to avoid a lot of data moving around
in such a case:
(balanced=backup and not (copies=backup:1)) or present
So once file lands on a backup drive, it stays there, even if more backup
drives change the balancing.
2024-03-15 14:13:12 +00:00
## use case: 3 of 5
What if we have 5 backup repos and want each key to be stored in 3 of them?
There's a simple change that can support that:
`balanced=group:3`
This works the same as before, but rather than just `N mod M`, take
`N+I mod M` where I is [0..2] to get the list of 3 repositories that want a
key.
2024-03-15 14:13:12 +00:00
However, once 3 of those 5 repos get full, new keys will only be able to be
stored on 2 of them. At that point one or more new repos will need to be
added to reach the goal of each key being stored in 3 of them. It would be
possible to rebalance the 3 full repos by moving some keys from them to the
other 2 repos, and eke out more storage before needing to add new
repositories. A separate rebalancing pass, that does not use preferred
content alone, could be implemented to handle this (see below).
2015-05-28 14:57:59 +00:00
2024-03-15 14:13:12 +00:00
## use case: geographically distinct datacenters
2024-03-04 21:04:59 +00:00
2024-03-12 20:41:25 +00:00
Of course this is not limited to backup drives. A more complicated example:
There are 4 geographically distributed datacenters, each of which has some
number of drives. Each file should have 1 copy stored in each datacenter,
on some drive there.
This can be implemented by making a group for each datacenter, which all of
its drives are in, and using `balanced` to pick the drive that holds the
2024-03-12 20:41:25 +00:00
copy of the file. The preferred content expression would be eg:
2024-03-15 14:13:12 +00:00
(balanced=datacenterA and not copies=datacenterA:1) or present
2024-03-12 20:41:25 +00:00
In such a situation, to avoid a `N^2` remote interconnect, there might be a
transfer repository in each datacenter, that is in front of its drives. The
transfer repository should want files that have not yet reached the
destination drive. How to write a preferred content expression for that?
It might be sufficient to use `copies=datacenterA:1`, so long as the file
reaching any drive in the datacenter is enough. But may want to add
something analagous to `inallgroup=` that checks if a file is in
the place that `balanced` picks for a group. Eg,
2024-03-12 20:41:25 +00:00
`balancedgroup=datacenterA` for 1 copy and `balancedgroup=group:datacenterA:2`
for N copies.
2024-03-15 14:13:12 +00:00
The [[design/passthrough_proxy]] idea is an alternate way to put a
repository in front of such a cluster, that does not need additional
extensions to preferred content.
## split brain situations
Of course, in the time after the git-annex branch was updated and before
it reaches the local repo, a repo can be full without us knowing about
it. Stores to it would fail, and perhaps be retried, until the updated
git-annex branch was synced.
In the worst case, a split brain situation
can make the balanced preferred content expression
pick a different repository to hold two independent
stores of the same key. Eg, when one side thinks one repo is full,
and the other side thinks the other repo is full.
If `present` is used in the preferred content, both of them will then
want to contain it. (Is `present` really needed like shown in the examples
above?)
If it's not, one of them will drop it and the other will
usually maintain its copy. It would perhaps be possible for both of
them to drop it, leading to a re-upload cycle. This needs some research
to see if it's a real problem.
See [[todo/proving_preferred_content_behavior]].
## rebalancing
In both the 3 of 5 use case and a split brain situation, it's possible for
don't sync with cluster nodes by default Avoid `git-annex sync --content` etc from operating on cluster nodes by default since syncing with a cluster implicitly syncs with its nodes. This avoids a lot of unncessary work when a cluster has a lot of nodes just in checking if each node's preferred content is satisfied. And it avoids content being sent to nodes individually, so instead syncing with clusters always fanout uploads to nodes. The downside is that there are situations where a cluster's preferred content settings can be met, but those of its nodes are not. Or where a node does not contain a key, but the cluster does, and there are not enough copies of the key yet, so it would be desirable the send it there. I think that's an acceptable tradeoff. These kind of situations are ones where the cluster itself should probably be responsible for copying content to the node. Which it can do much less expensively than a client can. Part of the balanced preferred content design that I will be working on in a couple of months involves rebalancing clusters, so I expect to revisit this. The use of annex-sync config does allow running git-annex sync with a specific node, or nodes, and it will sync with it. And it's also possible to set annex-sync git configs to make it sync with a node by default. (Although that will require setting up an explicit git remote for the node rather than relying on the proxied remote.) Logs.Cluster.Basic is needed because Remote.Git cannot import Logs.Cluster due to a cycle. And the Annex.Startup load of clusters happens too late for Remote.Git to use that. This does mean one redundant load of the cluster log, though only when there is a proxy.
2024-06-25 14:06:28 +00:00
content to end up not optimally balanced between repositories.
(There are also situations where a cluster node ends up without a copy
of a file that is preferred content, or where adding a copy to a node
would satisfy numcopies. This can happen eg, when a client sends a file
to a single node rather than to the cluster. Rebalancing also will deal
with those.)
git-annex can be made to operate in a mode where it does additional work
to rebalance repositories.
2024-03-15 14:13:12 +00:00
This can be an option like --rebalance, that changes how the preferred content
expression is evaluated. The user can choose where and when to run that.
Eg, it might be run on a node inside a cluster after adding more storage to
the cluster.
In several examples above, we have preferred content expressions in this
form:
(balanced=group:N and not copies=group:N) or present
2024-03-15 14:13:12 +00:00
In order to rebalance, that needs to be changed to:
2024-03-12 20:41:25 +00:00
balanced=group:N
2024-03-04 21:04:59 +00:00
2024-03-15 14:13:12 +00:00
What could be done is make `balanced()` usually expand to the former,
but when --rebalance is used, it only expands to the latter.
2024-03-13 15:06:59 +00:00
(Might make the fully balanced behavior available as `fullybalanced` for
2024-03-15 14:13:12 +00:00
users who want it, then
`balanced=group:N == (fullybalanced=group:N and not copies=group:N) or present`
usually and when --rebalance is used, `balanced=group:N == fullybalanced=group:N)`
2024-03-13 15:06:59 +00:00
2024-03-13 15:19:04 +00:00