improve docs

This commit is contained in:
Joey Hess 2024-06-25 17:50:22 -04:00
parent 0a1001dbfb
commit e3dd29409b
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 31 additions and 17 deletions

View file

@ -1,20 +1,22 @@
A git-annex repository can provide access to its remotes as nodes of a A cluster is a collection of git-annex repositories which are combined to
cluster. This allows other repositories to access the cluster as a single form a single logical repository.
logical repository.
A cluster is accessed via a gateway repository. The gateway is not itself
a node of the cluster.
[[!toc ]] [[!toc ]]
## using a cluster ## using a cluster
To use a cluster, your repository needs to have a remote that serves the To use a cluster, your repository needs to have its gateway configured as a
cluster. Clusters can currently only be accessed via ssh. This remote remote. Clusters can currently only be accessed via ssh. This gateway
is added the same as any other remote: remote is added the same as any other remote:
git remote add bigserver me@bigserver:annex git remote add bigserver me@bigserver:annex
The remote publishes information about the cluster that it serves The gateway publishes information about the cluster to the git-annex
to the git-annex branch. (See below for how that is configured.) So you may branch. (See below for how that is configured.) So you may need to fetch
need to fetch from it to learn about the cluster that it serves: from it to learn about the cluster:
git fetch bigserver git fetch bigserver
@ -34,7 +36,8 @@ they are stored to:
$ git-annex move bar --to bigserver-mycluster $ git-annex move bar --to bigserver-mycluster
move bar (to bigserver-mycluster...) ok move bar (to bigserver-mycluster...) ok
In fact, a single upload can be sent to every node of the cluster at once. In fact, a single upload like that can be sent to every node of the cluster
at once, very efficiently.
$ git-annex whereis bar $ git-annex whereis bar
whereis bar (3 copies) whereis bar (3 copies)
@ -50,10 +53,13 @@ so the 3 copies are the copies on individual nodes.
Most other git-annex commands that operate on repositories can also operate on Most other git-annex commands that operate on repositories can also operate on
clusters. clusters.
A cluster is not a git repository, and so `git pull bigserver-mycluster`
will not work.
## configuring a cluster ## configuring a cluster
A new cluster first needs to be initialized. Run [[git-annex-initcluster]] in A new cluster first needs to be initialized. Run [[git-annex-initcluster]] in
the repository that will serve the cluster to clients. In the example above, the repository that will serve as the cluster's gateway. In the example above,
this was the "bigserver" repository. this was the "bigserver" repository.
$ git-annex initcluster mycluster $ git-annex initcluster mycluster
@ -107,3 +113,10 @@ For example:
By default, when a file is uploaded to a cluster, it is stored on every node of By default, when a file is uploaded to a cluster, it is stored on every node of
the cluster. To control which nodes to store to, the [[preferred_content]] of the cluster. To control which nodes to store to, the [[preferred_content]] of
each node can be configured. each node can be configured.
It's also a good idea to configure the preferred content of the cluster's
gateway. To avoid files redundantly being stored on the gateway
(which remember, is not a node of the cluster), you might make it not want
any files:
$ git-annex wanted bigserver nothing

View file

@ -26,17 +26,18 @@ In development on the `proxy` branch.
For June's work on [[design/passthrough_proxy]], remaining todos: For June's work on [[design/passthrough_proxy]], remaining todos:
* Getting a key from a cluster currently always selects the lowest cost * Since proxying to special remotes is not supported yet, and won't be for
remote, and always the same remote if cost is the same. Should the first release, make it fail in a reasonable way.
round-robin amoung remotes, and prefer to avoid using remotes that
other git-annex processes are currently using.
* Basic proxying to special remote support (non-streaming).
* Support distributed clusters: Make a proxy for a cluster repeat * Support distributed clusters: Make a proxy for a cluster repeat
protocol messages on to any remotes that have the same UUID as protocol messages on to any remotes that have the same UUID as
the cluster. Needs VIA extension to P2P protocol to avoid cycles. the cluster. Needs VIA extension to P2P protocol to avoid cycles.
* Getting a key from a cluster currently always selects the lowest cost
remote, and always the same remote if cost is the same. Should
round-robin amoung remotes, and prefer to avoid using remotes that
other git-annex processes are currently using.
* Optimise proxy speed. See design for ideas. * Optimise proxy speed. See design for ideas.
* Use `sendfile()` to avoid data copying overhead when * Use `sendfile()` to avoid data copying overhead when