improve docs

2024-06-25 17:50:22 -04:00 · 2024-06-25 17:50:22 -04:00 · e3dd29409b
commit e3dd29409b
parent 0a1001dbfb
2 changed files with 31 additions and 17 deletions
--- a/doc/clusters.mdwn
+++ b/doc/clusters.mdwn
@ -1,20 +1,22 @@
-A git-annex repository can provide access to its remotes as nodes of a
-cluster. This allows other repositories to access the cluster as a single
-logical repository.
+A cluster is a collection of git-annex repositories which are combined to
+form a single logical repository.
+
+A cluster is accessed via a gateway repository. The gateway is not itself
+a node of the cluster.

 [[!toc ]]

 ## using a cluster

-To use a cluster, your repository needs to have a remote that serves the
-cluster. Clusters can currently only be accessed via ssh. This remote
-is added the same as any other remote:
+To use a cluster, your repository needs to have its gateway configured as a
+remote. Clusters can currently only be accessed via ssh. This gateway
+remote is added the same as any other remote:

    git remote add bigserver me@bigserver:annex

-The remote publishes information about the cluster that it serves
-to the git-annex branch. (See below for how that is configured.) So you may
-need to fetch from it to learn about the cluster that it serves:
+The gateway publishes information about the cluster to the git-annex
+branch. (See below for how that is configured.) So you may need to fetch
+from it to learn about the cluster:

    git fetch bigserver

@ -34,7 +36,8 @@ they are stored to:
    $ git-annex move bar --to bigserver-mycluster
    move bar (to bigserver-mycluster...) ok

-In fact, a single upload can be sent to every node of the cluster at once. 
+In fact, a single upload like that can be sent to every node of the cluster
+at once, very efficiently.
    
    $ git-annex whereis bar
 	whereis bar (3 copies)
@ -50,10 +53,13 @@ so the 3 copies are the copies on individual nodes.
 Most other git-annex commands that operate on repositories can also operate on
 clusters.

+A cluster is not a git repository, and so `git pull bigserver-mycluster`
+will not work.
+
 ## configuring a cluster

 A new cluster first needs to be initialized. Run [[git-annex-initcluster]] in
-the repository that will serve the cluster to clients. In the example above,
+the repository that will serve as the cluster's gateway. In the example above,
 this was the "bigserver" repository.

 	$ git-annex initcluster mycluster
@ -107,3 +113,10 @@ For example:
 By default, when a file is uploaded to a cluster, it is stored on every node of
 the cluster. To control which nodes to store to, the [[preferred_content]] of
 each node can be configured.
+
+It's also a good idea to configure the preferred content of the cluster's
+gateway. To avoid files redundantly being stored on the gateway
+(which remember, is not a node of the cluster), you might make it not want
+any files:
+
+    $ git-annex wanted bigserver nothing
--- a/doc/todo/git-annex_proxies.mdwn
+++ b/doc/todo/git-annex_proxies.mdwn
@ -26,17 +26,18 @@ In development on the `proxy` branch.

 For June's work on [[design/passthrough_proxy]], remaining todos:

-* Getting a key from a cluster currently always selects the lowest cost
-  remote, and always the same remote if cost is the same. Should
-  round-robin amoung remotes, and prefer to avoid using remotes that
-  other git-annex processes are currently using.
-
-* Basic proxying to special remote support (non-streaming).
+* Since proxying to special remotes is not supported yet, and won't be for
+  the first release, make it fail in a reasonable way.

 * Support distributed clusters: Make a proxy for a cluster repeat
  protocol messages on to any remotes that have the same UUID as
  the cluster. Needs VIA extension to P2P protocol to avoid cycles.

+* Getting a key from a cluster currently always selects the lowest cost
+  remote, and always the same remote if cost is the same. Should
+  round-robin amoung remotes, and prefer to avoid using remotes that
+  other git-annex processes are currently using.
+
 * Optimise proxy speed. See design for ideas.

 * Use `sendfile()` to avoid data copying overhead when