git-annex/doc/clusters.mdwn

A git-annex repository can provide access to its remotes as nodes of a
cluster. This allows other repositories to access the cluster as a single
logical repository.

[[!toc ]]

## using a cluster

To use a cluster, your repository needs to have a remote that serves the
cluster. Clusters can currently only be accessed via ssh. This remote
is added the same as any other remote:

    git remote add bigserver me@bigserver:annex

The remote publishes information about the cluster that it serves
to the git-annex branch. (See below for how that is configured.) So you may
need to fetch from it to learn about the cluster that it serves:

    git fetch bigserver

That will make available an additional remote for the cluster, eg
"bigserver-mycluster", as well as some remotes for each node eg
"bigserver-node1", "bigserver-node2", etc.

You can get files from the cluster without caring which node it comes
from:

    $ git-annex get foo --from bigserver-mycluster
    copy foo (from bigserver-mycluster...) ok

And you can send files to the cluster, without caring what nodes
they are stored to:

    $ git-annex move bar --to bigserver-mycluster
    move bar (to bigserver-mycluster...) ok

In fact, a single upload can be sent to every node of the cluster at once.

    $ git-annex whereis bar
	whereis bar (3 copies)
	  	acae2ff6-6c1e-8bec-b8b9-397a3755f397 -- my cluster [bigserver-mycluster]
	   	9f514001-6dc0-4d83-9af3-c64c96626892 -- node 1 [bigserver-node1]
	   	d81e0b28-612e-4d73-a4e6-6dabbb03aba1 -- node 2 [bigserver-node2]
	    5657baca-2f11-11ef-ae1a-5b68c6321dd9 -- node 3 [bigserver-node3]

Notice that the file is shown as present in the cluster, as well as on
individual nodes. But the cluster itself does not count as a copy of the file,
so the 3 copies are the copies on individual nodes.

Most other git-annex commands that operate on repositories can also operate on
clusters.

## configuring a cluster

A new cluster first needs to be initialized. Run [[git-annex-initcluster]] in
the repository that will serve the cluster to clients. In the example above,
this was the "bigserver" repository.

	$ git-annex initcluster mycluster

Once a cluster is initialized, the next step is to add nodes to it.
To make a remote be a node of the cluster, configure
`git config remote.name.annex-cluster-node`, setting it to the
name of the cluster.

In the example above, the three cluster nodes were configured like this:

	$ git remote add node1 /media/disk1/repo
	$ git remote add node2 /media/disk2/repo
	$ git remote add node3 /media/disk3/repo
	$ git config remote.node1.annex-cluster-node true
	$ git config remote.node2.annex-cluster-node true
	$ git config remote.node3.annex-cluster-node true

Finally, run `git-annex updatecluster` to record the cluster configuration
in the git-annex branch. That tells other repositories about the cluster.

	$ git-annex updatecluster mycluster
	Added node node1 to cluster: mycluster
	Added node node2 to cluster: mycluster
	Added node node3 to cluster: mycluster
	Started proxying for node1
	Started proxying for node2
	Started proxying for node3

## preferred content of clusters

The preferred content of the cluster can be configured. This tells
users what files the cluster as a whole should contain.

To configure the preferred content of a cluster, as well as other related
things like [[groups|git-annex-group]] and [[required_content]], it's easiest
to do the configuration in a repository that has the cluster as a remote.

For example:

	git-annex wanted bigserver-mycluster standard
	git-annex group bigserver-mycluster archive

By default, when a file is uploaded to a cluster, it is stored on every node of
the cluster. To control which nodes to store to, the [[preferred_content]] of
each node can be configured.