2024-06-20 14:57:43 +00:00
|
|
|
A git-annex repository can provide access to its remotes as nodes of a
|
|
|
|
cluster. This allows other repositories to access the cluster as a single
|
|
|
|
logical repository.
|
|
|
|
|
|
|
|
[[!toc ]]
|
|
|
|
|
|
|
|
## using a cluster
|
|
|
|
|
gave up on upload fanout to cluster's proxy
The problem with that idea is that the cluster's proxy is necessarily a
remote, and necessarily one that we'll want to sync with, since the git
repository is stored there. So when its preferred content wants a file,
and the cluster does too, the file will get uploaded to it as well as to
the cluster. With fanout, the upload to the cluster will populate the
proxy as well, avoiding a second upload. But only if the file is sent to
the cluster first. If it's sent to the proxy first, there will be two
uploads.
Another, lesser problem is that a repository can proxy for more than one
cluster. So when does it make sense to drop content from the repository?
It could be done when dropping from one cluster, but what of the other
one?
This complication was not necessary anyway. Instead, if it's desirable
to have some content accessed from close to the proxy, one of the
cluster nodes can just be put on the same filesystem as it. That will be
just as fast as storing the content on the proxy.
2024-06-25 17:35:12 +00:00
|
|
|
To use a cluster, your repository needs to have a remote that serves the
|
|
|
|
cluster. Clusters can currently only be accessed via ssh. This remote
|
|
|
|
is added the same as any other remote:
|
2024-06-20 14:57:43 +00:00
|
|
|
|
gave up on upload fanout to cluster's proxy
The problem with that idea is that the cluster's proxy is necessarily a
remote, and necessarily one that we'll want to sync with, since the git
repository is stored there. So when its preferred content wants a file,
and the cluster does too, the file will get uploaded to it as well as to
the cluster. With fanout, the upload to the cluster will populate the
proxy as well, avoiding a second upload. But only if the file is sent to
the cluster first. If it's sent to the proxy first, there will be two
uploads.
Another, lesser problem is that a repository can proxy for more than one
cluster. So when does it make sense to drop content from the repository?
It could be done when dropping from one cluster, but what of the other
one?
This complication was not necessary anyway. Instead, if it's desirable
to have some content accessed from close to the proxy, one of the
cluster nodes can just be put on the same filesystem as it. That will be
just as fast as storing the content on the proxy.
2024-06-25 17:35:12 +00:00
|
|
|
git remote add bigserver me@bigserver:annex
|
|
|
|
|
|
|
|
The remote publishes information about the cluster that it serves
|
|
|
|
to the git-annex branch. (See below for how that is configured.) So you may
|
|
|
|
need to fetch from it to learn about the cluster that it serves:
|
|
|
|
|
|
|
|
git fetch bigserver
|
|
|
|
|
|
|
|
That will make available an additional remote for the cluster, eg
|
|
|
|
"bigserver-mycluster", as well as some remotes for each node eg
|
|
|
|
"bigserver-node1", "bigserver-node2", etc.
|
|
|
|
|
|
|
|
You can get files from the cluster without caring which node it comes
|
2024-06-20 14:57:43 +00:00
|
|
|
from:
|
|
|
|
|
|
|
|
$ git-annex get foo --from bigserver-mycluster
|
|
|
|
copy foo (from bigserver-mycluster...) ok
|
|
|
|
|
gave up on upload fanout to cluster's proxy
The problem with that idea is that the cluster's proxy is necessarily a
remote, and necessarily one that we'll want to sync with, since the git
repository is stored there. So when its preferred content wants a file,
and the cluster does too, the file will get uploaded to it as well as to
the cluster. With fanout, the upload to the cluster will populate the
proxy as well, avoiding a second upload. But only if the file is sent to
the cluster first. If it's sent to the proxy first, there will be two
uploads.
Another, lesser problem is that a repository can proxy for more than one
cluster. So when does it make sense to drop content from the repository?
It could be done when dropping from one cluster, but what of the other
one?
This complication was not necessary anyway. Instead, if it's desirable
to have some content accessed from close to the proxy, one of the
cluster nodes can just be put on the same filesystem as it. That will be
just as fast as storing the content on the proxy.
2024-06-25 17:35:12 +00:00
|
|
|
And you can send files to the cluster, without caring what nodes
|
2024-06-20 14:57:43 +00:00
|
|
|
they are stored to:
|
|
|
|
|
|
|
|
$ git-annex move bar --to bigserver-mycluster
|
|
|
|
move bar (to bigserver-mycluster...) ok
|
|
|
|
|
gave up on upload fanout to cluster's proxy
The problem with that idea is that the cluster's proxy is necessarily a
remote, and necessarily one that we'll want to sync with, since the git
repository is stored there. So when its preferred content wants a file,
and the cluster does too, the file will get uploaded to it as well as to
the cluster. With fanout, the upload to the cluster will populate the
proxy as well, avoiding a second upload. But only if the file is sent to
the cluster first. If it's sent to the proxy first, there will be two
uploads.
Another, lesser problem is that a repository can proxy for more than one
cluster. So when does it make sense to drop content from the repository?
It could be done when dropping from one cluster, but what of the other
one?
This complication was not necessary anyway. Instead, if it's desirable
to have some content accessed from close to the proxy, one of the
cluster nodes can just be put on the same filesystem as it. That will be
just as fast as storing the content on the proxy.
2024-06-25 17:35:12 +00:00
|
|
|
In fact, a single upload can be sent to every node of the cluster at once.
|
2024-06-20 14:57:43 +00:00
|
|
|
|
|
|
|
$ git-annex whereis bar
|
|
|
|
whereis bar (3 copies)
|
|
|
|
acae2ff6-6c1e-8bec-b8b9-397a3755f397 -- my cluster [bigserver-mycluster]
|
|
|
|
9f514001-6dc0-4d83-9af3-c64c96626892 -- node 1 [bigserver-node1]
|
|
|
|
d81e0b28-612e-4d73-a4e6-6dabbb03aba1 -- node 2 [bigserver-node2]
|
|
|
|
5657baca-2f11-11ef-ae1a-5b68c6321dd9 -- node 3 [bigserver-node3]
|
|
|
|
|
|
|
|
Notice that the file is shown as present in the cluster, as well as on
|
|
|
|
individual nodes. But the cluster itself does not count as a copy of the file,
|
|
|
|
so the 3 copies are the copies on individual nodes.
|
|
|
|
|
|
|
|
Most other git-annex commands that operate on repositories can also operate on
|
|
|
|
clusters.
|
|
|
|
|
|
|
|
## configuring a cluster
|
|
|
|
|
|
|
|
A new cluster first needs to be initialized. Run [[git-annex-initcluster]] in
|
|
|
|
the repository that will serve the cluster to clients. In the example above,
|
|
|
|
this was the "bigserver" repository.
|
|
|
|
|
|
|
|
$ git-annex initcluster mycluster
|
|
|
|
|
|
|
|
Once a cluster is initialized, the next step is to add nodes to it.
|
|
|
|
To make a remote be a node of the cluster, configure
|
|
|
|
`git config remote.name.annex-cluster-node`, setting it to the
|
|
|
|
name of the cluster.
|
|
|
|
|
|
|
|
In the example above, the three cluster nodes were configured like this:
|
|
|
|
|
|
|
|
$ git remote add node1 /media/disk1/repo
|
|
|
|
$ git remote add node2 /media/disk2/repo
|
|
|
|
$ git remote add node3 /media/disk3/repo
|
|
|
|
$ git config remote.node1.annex-cluster-node true
|
|
|
|
$ git config remote.node2.annex-cluster-node true
|
|
|
|
$ git config remote.node3.annex-cluster-node true
|
|
|
|
|
|
|
|
Finally, run `git-annex updatecluster` to record the cluster configuration
|
|
|
|
in the git-annex branch. That tells other repositories about the cluster.
|
|
|
|
|
|
|
|
$ git-annex updatecluster mycluster
|
|
|
|
Added node node1 to cluster: mycluster
|
|
|
|
Added node node2 to cluster: mycluster
|
|
|
|
Added node node3 to cluster: mycluster
|
|
|
|
Started proxying for node1
|
|
|
|
Started proxying for node2
|
|
|
|
Started proxying for node3
|
|
|
|
|
2024-06-25 18:52:47 +00:00
|
|
|
Operations that affect multiple nodes of a cluster can often be sped up by
|
|
|
|
configuring annex.jobs in the repository that will serve the cluster to
|
|
|
|
clients. In the example above, the nodes are all disk bound, so operating
|
|
|
|
on more than one at a time will likely be faster.
|
|
|
|
|
|
|
|
$ git config annex.jobs cpus
|
|
|
|
|
2024-06-20 14:57:43 +00:00
|
|
|
## preferred content of clusters
|
|
|
|
|
|
|
|
The preferred content of the cluster can be configured. This tells
|
|
|
|
users what files the cluster as a whole should contain.
|
|
|
|
|
|
|
|
To configure the preferred content of a cluster, as well as other related
|
|
|
|
things like [[groups|git-annex-group]] and [[required_content]], it's easiest
|
|
|
|
to do the configuration in a repository that has the cluster as a remote.
|
|
|
|
|
|
|
|
For example:
|
|
|
|
|
2024-06-25 18:52:47 +00:00
|
|
|
$ git-annex wanted bigserver-mycluster standard
|
|
|
|
$ git-annex group bigserver-mycluster archive
|
2024-06-20 14:57:43 +00:00
|
|
|
|
|
|
|
By default, when a file is uploaded to a cluster, it is stored on every node of
|
|
|
|
the cluster. To control which nodes to store to, the [[preferred_content]] of
|
|
|
|
each node can be configured.
|