support simulating clusters
Without actually simulating cluster implementation at all. Instead, only the essential fact that cluster gateways know what changes they have made to each node of a cluster. That is enough for sims like sizebalanced_cluster.
This commit is contained in:
parent
61c95f4d29
commit
8e94b75a61
5 changed files with 84 additions and 112 deletions
|
@ -398,6 +398,8 @@ as passed to "git annex sim" while a simulation is running.
|
|||
group node2 cluster
|
||||
wanted node1 sizebalanced=cluster
|
||||
wanted node2 sizebalanced=cluster
|
||||
maxsize node1 100gb
|
||||
maxsize node2 100gb
|
||||
connect cluster-node2 <- foo -> cluster-node1
|
||||
connect cluster-node2 <- bar -> cluster-node1
|
||||
addmulti 10 foo 1gb 2gb foo
|
||||
|
@ -405,9 +407,9 @@ as passed to "git annex sim" while a simulation is running.
|
|||
action foo sendwanted cluster-node1 while action foo sendwanted cluster-node2 while action bar sendwanted cluster-node1 while action bar sendwanted cluster-node2
|
||||
|
||||
In the above example, while foo and bar are both concurrently sending
|
||||
wanted files to both nodes, each will know immediately which files have
|
||||
been sent by the other, and so the files will be sizebalanced between
|
||||
them optimally.
|
||||
wanted files to both cluster nodes, each will know immediately which
|
||||
files have been sent by the other, and so the files will be sizebalanced
|
||||
between them optimally.
|
||||
|
||||
# OPTIONS
|
||||
|
||||
|
|
23
doc/sims/sizebalanced_cluster.mdwn
Normal file
23
doc/sims/sizebalanced_cluster.mdwn
Normal file
|
@ -0,0 +1,23 @@
|
|||
# Size balanced preferred content sim with multiple repositories sending
|
||||
# concurrently to the same repositories, in a cluster.
|
||||
#
|
||||
# This demonstrates that size balanced preferred content does not get out
|
||||
# of balance when used with cluster nodes.
|
||||
init foo
|
||||
init bar
|
||||
init node1
|
||||
init node2
|
||||
clusternode cluster-node1 node1
|
||||
clusternode cluster-node2 node2
|
||||
group node1 cluster
|
||||
group node2 cluster
|
||||
wanted node1 sizebalanced=cluster
|
||||
wanted node2 sizebalanced=cluster
|
||||
maxsize node1 100gb
|
||||
maxsize node2 100gb
|
||||
connect cluster-node2 <- foo -> cluster-node1
|
||||
connect cluster-node2 <- bar -> cluster-node1
|
||||
addmulti 10 foo 1gb 2gb foo
|
||||
addmulti 10 bar 1gb 2gb bar
|
||||
action foo sendwanted cluster-node1 while action foo sendwanted cluster-node2 while action bar sendwanted cluster-node1 while action bar sendwanted cluster-node2
|
||||
visit foo git-annex maxsize
|
|
@ -30,101 +30,6 @@ Planned schedule of work:
|
|||
|
||||
* Currently working in [[todo/proving_preferred_content_behavior]]
|
||||
|
||||
* sim: Can a cluster using size balanced preferred content be simulated?
|
||||
May need the sim to get the concept of a cluster gateway, since the
|
||||
gateway is what picks amoung the nodes on the basis of size. On the other
|
||||
hand, it may suffice to connect the client repo directly to each node of
|
||||
the cluster, and let that repo pick which nodes to send to.
|
||||
|
||||
The difference between having a cluster gateway and direct connections to
|
||||
the nodes is when there are multiple clients. The cluster gateway updates
|
||||
its location logs to reflect changes in the nodes that get proxied via
|
||||
it. So it will pick a node that is not full when using size balanced
|
||||
preferred content. If two clients are accessing a node directly without a
|
||||
cluster gateway, that doesn't happen.
|
||||
|
||||
So, for a cluster accessed via a single client, direct connections to the
|
||||
nodes are ok for the sim. But for multiple clients, the sim would need to
|
||||
support clusters.
|
||||
|
||||
Would it suffice, if a repo is a node in a cluster, for every change to
|
||||
its location log to be immediately propagated to every other repo in the
|
||||
sim that has a connection to it? That simulates the centralized view that
|
||||
the cluster gateway has, without the complication of actually simulating
|
||||
a cluster gateway.
|
||||
|
||||
That would not allow simulating a cluster node that is
|
||||
also accessed directly via another repository. But cluster nodes
|
||||
generally should not be accessed except via the gateway. Still, to allow
|
||||
simulating that, it would be possible to have a new type of connection,
|
||||
which is via a gateway. Use eg "-g->" for it. Then to simulate a cluster,
|
||||
which foo is accessing via a gateway:
|
||||
|
||||
connect node1 <-g- foo -g-> node2
|
||||
connect node1 <-g- bar -g-> node2
|
||||
|
||||
What that would do is, for every change in foo's location log for node1
|
||||
or node2, immediately propagate it to bar's location log.
|
||||
|
||||
Or an alternative syntax:
|
||||
|
||||
cluster g node1 node2
|
||||
connect g-node1 <- foo -> g-node2
|
||||
connect g-node1 <- bar -> g-node2
|
||||
|
||||
The only thing that does not allow simulating is 2 cluster gateways
|
||||
that each proxy for some of the same nodes. In that situation, there
|
||||
are two views of the contents of the nodes, which is similar to two
|
||||
clients having direct connections to the nodes, but not the same when
|
||||
there are more than 2 clients connected to the 2 gateways. Simulating
|
||||
that would require a first-class gateway simulation with its own location
|
||||
log and node selection.
|
||||
|
||||
Alternative approach: Let a cluster node be initialized, which is an
|
||||
overlay over a repository which shares all of its configuration
|
||||
except for its uuid. Every change to the location log of a cluster
|
||||
node is immediately propigated to every repository that has a connection
|
||||
to it. It is also propigated to the underlaying repository. This lets
|
||||
more than one cluster node be initialized for the same repository, for
|
||||
when it is in multiple clusters or behind multiple gateways in the same
|
||||
cluster.
|
||||
|
||||
clusternode mycluster-foo foo
|
||||
clusternode othercluster-foo foo
|
||||
|
||||
Implementation plan for this:
|
||||
|
||||
* clusternode initializes a new cluster node UUID, and adds to
|
||||
simRepos.
|
||||
* add `simClusterNodes :: M.Map UUID (UUID, RemoteName)`,
|
||||
which maps from the cluster node UUID to the UUID of the underlying
|
||||
repo, and its node name.
|
||||
* clusternode also adds to simClusterNodes.
|
||||
* setPresentKey checks if the UUID is in simClusterNodes.
|
||||
* If it is, it makes the key present/missing in the underlying repo
|
||||
UUID as well.
|
||||
* And, it looks through simConnections to find any other repos that
|
||||
also have a connection to the cluster node with that name.
|
||||
Each of those repos also gets its simLocations updated.
|
||||
|
||||
But: The cluster node UUID would need to have the same preferred content
|
||||
etc as the underlying repo. And, it would need to be in the same groups.
|
||||
And it would be counted as another copy. Could use a cluster UUID to
|
||||
avoid the numcopies count. But can adding a separate UUID be avoided?
|
||||
|
||||
Implementation plan for this without separate UUID:
|
||||
|
||||
* add `simClusterNodes :: M.Map RepoName UUID`,
|
||||
* clusternode adds to simClusterNodes.
|
||||
* checkKnownRemote needs to check simClusterNodes as well as
|
||||
simRepos so that cluster nodes can be used as remotes.
|
||||
* Plumb repo name through to setPresentKey.
|
||||
* setPresentKey checks if repo name is in simClusterNodes.
|
||||
* If it is, it looks through simConnections to find any other
|
||||
repos that also have a connection to the cluster node with
|
||||
that name. Each of those repos also gets its simLocations updated
|
||||
for the change being logged.
|
||||
|
||||
* sim: Add support for metadata, so preferred content that matches on it
|
||||
will work
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue