A cluster is a collection of git-annex repositories which are combined to
form a single logical repository.

A cluster is accessed via a gateway repository. The gateway is not itself
a node of the cluster.

[[!toc ]]

## using a cluster

To use a cluster, your repository needs to have its gateway configured as a
remote. Clusters can currently only be accessed via ssh. This gateway
remote is added the same as any other remote:

    git remote add bigserver me@bigserver:annex

The gateway publishes information about the cluster to the git-annex
branch. (See below for how that is configured.) So you may need to fetch
from it to learn about the cluster:

    git fetch bigserver

That will make available an additional remote for the cluster, eg
"bigserver-mycluster", as well as some remotes for each node eg
"bigserver-node1", "bigserver-node2", etc.

You can get files from the cluster without caring which node it comes
from:

    $ git-annex get foo --from bigserver-mycluster
    copy foo (from bigserver-mycluster...) ok

And you can send files to the cluster, without caring what nodes
they are stored to:

    $ git-annex move bar --to bigserver-mycluster
    move bar (to bigserver-mycluster...) ok

In fact, a single upload like that can be sent to every node of the cluster
at once, very efficiently.
    
    $ git-annex whereis bar
	whereis bar (3 copies)
	  	acae2ff6-6c1e-8bec-b8b9-397a3755f397 -- my cluster [bigserver-mycluster]
	   	9f514001-6dc0-4d83-9af3-c64c96626892 -- node 1 [bigserver-node1]
	   	d81e0b28-612e-4d73-a4e6-6dabbb03aba1 -- node 2 [bigserver-node2]
	    5657baca-2f11-11ef-ae1a-5b68c6321dd9 -- node 3 [bigserver-node3]

Notice that the file is shown as present in the cluster, as well as on
individual nodes. But the cluster itself does not count as a copy of the file,
so the 3 copies are the copies on individual nodes.

Most other git-annex commands that operate on repositories can also operate on
clusters.

A cluster is not a git repository, and so `git pull bigserver-mycluster`
will not work.

## configuring a cluster

A new cluster first needs to be initialized. Run [[git-annex-initcluster]] in
the repository that will serve as the cluster's gateway. In the example above,
this was the "bigserver" repository.

	$ git-annex initcluster mycluster

Once a cluster is initialized, the next step is to add nodes to it.
To make a remote be a node of the cluster, configure 
`git config remote.name.annex-cluster-node`, setting it to the
name of the cluster.

In the example above, the three cluster nodes were configured like this:

	$ git remote add node1 /media/disk1/repo
	$ git remote add node2 /media/disk2/repo
	$ git remote add node3 /media/disk3/repo
	$ git config remote.node1.annex-cluster-node true
	$ git config remote.node2.annex-cluster-node true
	$ git config remote.node3.annex-cluster-node true

Finally, run `git-annex updatecluster` to record the cluster configuration
in the git-annex branch. That tells other repositories about the cluster.
	
	$ git-annex updatecluster mycluster
	Added node node1 to cluster: mycluster
	Added node node2 to cluster: mycluster
	Added node node3 to cluster: mycluster
	Started proxying for node1
	Started proxying for node2
	Started proxying for node3

Operations that affect multiple nodes of a cluster can often be sped up by
configuring annex.jobs in the repository that will serve the cluster to
clients. In the example above, the nodes are all disk bound, so operating
on more than one at a time will likely be faster.

    $ git config annex.jobs cpus

## preferred content of clusters

The preferred content of the cluster can be configured. This tells
users what files the cluster as a whole should contain.

To configure the preferred content of a cluster, as well as other related
things like [[groups|git-annex-group]] and [[required_content]], it's easiest
to do the configuration in a repository that has the cluster as a remote.

For example:

	$ git-annex wanted bigserver-mycluster standard
	$ git-annex group bigserver-mycluster archive

By default, when a file is uploaded to a cluster, it is stored on every node of
the cluster. To control which nodes to store to, the [[preferred_content]] of
each node can be configured.

It's also a good idea to configure the preferred content of the cluster's
gateway. To avoid files redundantly being stored on the gateway
(which remember, is not a node of the cluster), you might make it not want
any files:

    $ git-annex wanted bigserver nothing