support simulating clusters

Without actually simulating cluster implementation at all. Instead, only the essential fact that cluster gateways know what changes they have made to each node of a cluster. That is enough for sims like sizebalanced_cluster.
2024-09-25 14:06:41 -04:00 · 2024-09-25 14:06:41 -04:00 · 8e94b75a61
commit 8e94b75a61
parent 61c95f4d29
5 changed files with 84 additions and 112 deletions
--- a/doc/git-annex-sim.mdwn
+++ b/doc/git-annex-sim.mdwn
@ -398,6 +398,8 @@ as passed to "git annex sim" while a simulation is running.
    group node2 cluster
    wanted node1 sizebalanced=cluster
    wanted node2 sizebalanced=cluster
+    maxsize node1 100gb
+    maxsize node2 100gb
    connect cluster-node2 <- foo -> cluster-node1
    connect cluster-node2 <- bar -> cluster-node1
    addmulti 10 foo 1gb 2gb foo 
@ -405,9 +407,9 @@ as passed to "git annex sim" while a simulation is running.
    action foo sendwanted cluster-node1 while action foo sendwanted cluster-node2 while action bar sendwanted cluster-node1 while action bar sendwanted cluster-node2

  In the above example, while foo and bar are both concurrently sending
-  wanted files to both nodes, each will know immediately which files have
-  been sent by the other, and so the files will be sizebalanced between
-  them optimally.
+  wanted files to both cluster nodes, each will know immediately which
+  files have been sent by the other, and so the files will be sizebalanced
+  between them optimally.

 # OPTIONS

--- a/doc/sims/sizebalanced_cluster.mdwn
+++ b/doc/sims/sizebalanced_cluster.mdwn
@ -0,0 +1,23 @@
+# Size balanced preferred content sim with multiple repositories sending  
+# concurrently to the same repositories, in a cluster.  
+#   
+# This demonstrates that size balanced preferred content does not get out  
+# of balance when used with cluster nodes.  
+init foo  
+init bar  
+init node1  
+init node2  
+clusternode cluster-node1 node1  
+clusternode cluster-node2 node2  
+group node1 cluster  
+group node2 cluster  
+wanted node1 sizebalanced=cluster  
+wanted node2 sizebalanced=cluster  
+maxsize node1 100gb  
+maxsize node2 100gb  
+connect cluster-node2 <- foo -> cluster-node1  
+connect cluster-node2 <- bar -> cluster-node1  
+addmulti 10 foo 1gb 2gb foo   
+addmulti 10 bar 1gb 2gb bar  
+action foo sendwanted cluster-node1 while action foo sendwanted cluster-node2 while action bar sendwanted cluster-node1 while action bar sendwanted cluster-node2  
+visit foo git-annex maxsize  
--- a/doc/todo/git-annex_proxies.mdwn
+++ b/doc/todo/git-annex_proxies.mdwn
@ -30,101 +30,6 @@ Planned schedule of work:

 * Currently working in [[todo/proving_preferred_content_behavior]]

-* sim: Can a cluster using size balanced preferred content be simulated?
-  May need the sim to get the concept of a cluster gateway, since the
-  gateway is what picks amoung the nodes on the basis of size. On the other
-  hand, it may suffice to connect the client repo directly to each node of
-  the cluster, and let that repo pick which nodes to send to.
-
-  The difference between having a cluster gateway and direct connections to
-  the nodes is when there are multiple clients. The cluster gateway updates
-  its location logs to reflect changes in the nodes that get proxied via
-  it. So it will pick a node that is not full when using size balanced
-  preferred content. If two clients are accessing a node directly without a
-  cluster gateway, that doesn't happen.
-
-  So, for a cluster accessed via a single client, direct connections to the
-  nodes are ok for the sim. But for multiple clients, the sim would need to
-  support clusters.
-
-  Would it suffice, if a repo is a node in a cluster, for every change to
-  its location log to be immediately propagated to every other repo in the
-  sim that has a connection to it? That simulates the centralized view that
-  the cluster gateway has, without the complication of actually simulating
-  a cluster gateway.
-
-  That would not allow simulating a cluster node that is
-  also accessed directly via another repository. But cluster nodes
-  generally should not be accessed except via the gateway. Still, to allow
-  simulating that, it would be possible to have a new type of connection,
-  which is via a gateway. Use eg "-g->" for it. Then to simulate a cluster,
-  which foo is accessing via a gateway:
-
-    connect node1 <-g- foo -g-> node2
-    connect node1 <-g- bar -g-> node2
-
-  What that would do is, for every change in foo's location log for node1
-  or node2, immediately propagate it to bar's location log.
-
-  Or an alternative syntax:
-
-    cluster g node1 node2
-    connect g-node1 <- foo -> g-node2
-    connect g-node1 <- bar -> g-node2
-
-  The only thing that does not allow simulating is 2 cluster gateways
-  that each proxy for some of the same nodes. In that situation, there
-  are two views of the contents of the nodes, which is similar to two
-  clients having direct connections to the nodes, but not the same when
-  there are more than 2 clients connected to the 2 gateways. Simulating
-  that would require a first-class gateway simulation with its own location
-  log and node selection.
-
-  Alternative approach: Let a cluster node be initialized, which is an
-  overlay over a repository which shares all of its configuration
-  except for its uuid. Every change to the location log of a cluster
-  node is immediately propigated to every repository that has a connection
-  to it. It is also propigated to the underlaying repository. This lets
-  more than one cluster node be initialized for the same repository, for
-  when it is in multiple clusters or behind multiple gateways in the same
-  cluster.
-
-    clusternode mycluster-foo foo
-    clusternode othercluster-foo foo
-
-  Implementation plan for this:
-
-  * clusternode initializes a new cluster node UUID, and adds to
-    simRepos.
-  * add `simClusterNodes :: M.Map UUID (UUID, RemoteName)`,
-    which maps from the cluster node UUID to the UUID of the underlying
-    repo, and its node name.
-  * clusternode also adds to simClusterNodes.
-  * setPresentKey checks if the UUID is in simClusterNodes.
-  * If it is, it makes the key present/missing in the underlying repo
-    UUID as well.
-  * And, it looks through simConnections to find any other repos that
-    also have a connection to the cluster node with that name.
-    Each of those repos also gets its simLocations updated.
-
-  But: The cluster node UUID would need to have the same preferred content
-  etc as the underlying repo. And, it would need to be in the same groups.
-  And it would be counted as another copy. Could use a cluster UUID to
-  avoid the numcopies count. But can adding a separate UUID be avoided?
-
-  Implementation plan for this without separate UUID:
-
-  * add `simClusterNodes :: M.Map RepoName UUID`,
-  * clusternode adds to simClusterNodes.
-  * checkKnownRemote needs to check simClusterNodes as well as
-    simRepos so that cluster nodes can be used as remotes.
-  * Plumb repo name through to setPresentKey.
-  * setPresentKey checks if repo name is in simClusterNodes.
-  * If it is, it looks through simConnections to find any other
-    repos that also have a connection to the cluster node with
-    that name. Each of those repos also gets its simLocations updated
-    for the change being logged.
-
 * sim: Add support for metadata, so preferred content that matches on it
  will work