From e3dd29409b91cdf98066102a4565f35bee572edd Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 25 Jun 2024 17:50:22 -0400 Subject: [PATCH] improve docs --- doc/clusters.mdwn | 35 ++++++++++++++++++++++----------- doc/todo/git-annex_proxies.mdwn | 13 ++++++------ 2 files changed, 31 insertions(+), 17 deletions(-) diff --git a/doc/clusters.mdwn b/doc/clusters.mdwn index bf79adbae8..deb7113f1f 100644 --- a/doc/clusters.mdwn +++ b/doc/clusters.mdwn @@ -1,20 +1,22 @@ -A git-annex repository can provide access to its remotes as nodes of a -cluster. This allows other repositories to access the cluster as a single -logical repository. +A cluster is a collection of git-annex repositories which are combined to +form a single logical repository. + +A cluster is accessed via a gateway repository. The gateway is not itself +a node of the cluster. [[!toc ]] ## using a cluster -To use a cluster, your repository needs to have a remote that serves the -cluster. Clusters can currently only be accessed via ssh. This remote -is added the same as any other remote: +To use a cluster, your repository needs to have its gateway configured as a +remote. Clusters can currently only be accessed via ssh. This gateway +remote is added the same as any other remote: git remote add bigserver me@bigserver:annex -The remote publishes information about the cluster that it serves -to the git-annex branch. (See below for how that is configured.) So you may -need to fetch from it to learn about the cluster that it serves: +The gateway publishes information about the cluster to the git-annex +branch. (See below for how that is configured.) So you may need to fetch +from it to learn about the cluster: git fetch bigserver @@ -34,7 +36,8 @@ they are stored to: $ git-annex move bar --to bigserver-mycluster move bar (to bigserver-mycluster...) ok -In fact, a single upload can be sent to every node of the cluster at once. +In fact, a single upload like that can be sent to every node of the cluster +at once, very efficiently. $ git-annex whereis bar whereis bar (3 copies) @@ -50,10 +53,13 @@ so the 3 copies are the copies on individual nodes. Most other git-annex commands that operate on repositories can also operate on clusters. +A cluster is not a git repository, and so `git pull bigserver-mycluster` +will not work. + ## configuring a cluster A new cluster first needs to be initialized. Run [[git-annex-initcluster]] in -the repository that will serve the cluster to clients. In the example above, +the repository that will serve as the cluster's gateway. In the example above, this was the "bigserver" repository. $ git-annex initcluster mycluster @@ -107,3 +113,10 @@ For example: By default, when a file is uploaded to a cluster, it is stored on every node of the cluster. To control which nodes to store to, the [[preferred_content]] of each node can be configured. + +It's also a good idea to configure the preferred content of the cluster's +gateway. To avoid files redundantly being stored on the gateway +(which remember, is not a node of the cluster), you might make it not want +any files: + + $ git-annex wanted bigserver nothing diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 00a6c37cfd..cb4f668c45 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -26,17 +26,18 @@ In development on the `proxy` branch. For June's work on [[design/passthrough_proxy]], remaining todos: -* Getting a key from a cluster currently always selects the lowest cost - remote, and always the same remote if cost is the same. Should - round-robin amoung remotes, and prefer to avoid using remotes that - other git-annex processes are currently using. - -* Basic proxying to special remote support (non-streaming). +* Since proxying to special remotes is not supported yet, and won't be for + the first release, make it fail in a reasonable way. * Support distributed clusters: Make a proxy for a cluster repeat protocol messages on to any remotes that have the same UUID as the cluster. Needs VIA extension to P2P protocol to avoid cycles. +* Getting a key from a cluster currently always selects the lowest cost + remote, and always the same remote if cost is the same. Should + round-robin amoung remotes, and prefer to avoid using remotes that + other git-annex processes are currently using. + * Optimise proxy speed. See design for ideas. * Use `sendfile()` to avoid data copying overhead when