gave up on upload fanout to cluster's proxy

The problem with that idea is that the cluster's proxy is necessarily a
remote, and necessarily one that we'll want to sync with, since the git
repository is stored there. So when its preferred content wants a file,
and the cluster does too, the file will get uploaded to it as well as to
the cluster. With fanout, the upload to the cluster will populate the
proxy as well, avoiding a second upload. But only if the file is sent to
the cluster first. If it's sent to the proxy first, there will be two
uploads.

Another, lesser problem is that a repository can proxy for more than one
cluster. So when does it make sense to drop content from the repository?
It could be done when dropping from one cluster, but what of the other
one?

This complication was not necessary anyway. Instead, if it's desirable
to have some content accessed from close to the proxy, one of the
cluster nodes can just be put on the same filesystem as it. That will be
just as fast as storing the content on the proxy.
This commit is contained in:
Joey Hess 2024-06-25 13:35:12 -04:00
parent 1bfe7f8a53
commit 5ede109ae5
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 18 additions and 20 deletions

View file

@ -6,23 +6,35 @@ logical repository.
## using a cluster
For example, a remote "bigserver" that is configured as a cluster will
make available an additional remote "bigserver-mycluster", as well as some
remotes for each node eg "bigserver-node1", "bigserver-node2", etc.
To use a cluster, your repository needs to have a remote that serves the
cluster. Clusters can currently only be accessed via ssh. This remote
is added the same as any other remote:
The user can get files from the cluster without caring which node it comes
git remote add bigserver me@bigserver:annex
The remote publishes information about the cluster that it serves
to the git-annex branch. (See below for how that is configured.) So you may
need to fetch from it to learn about the cluster that it serves:
git fetch bigserver
That will make available an additional remote for the cluster, eg
"bigserver-mycluster", as well as some remotes for each node eg
"bigserver-node1", "bigserver-node2", etc.
You can get files from the cluster without caring which node it comes
from:
$ git-annex get foo --from bigserver-mycluster
copy foo (from bigserver-mycluster...) ok
And the user can send files to the cluster, without caring what nodes
And you can send files to the cluster, without caring what nodes
they are stored to:
$ git-annex move bar --to bigserver-mycluster
move bar (to bigserver-mycluster...) ok
In fact, a single upload can be sent to every node of the cluster at once.
In fact, a single upload can be sent to every node of the cluster at once.
$ git-annex whereis bar
whereis bar (3 copies)
@ -38,8 +50,6 @@ so the 3 copies are the copies on individual nodes.
Most other git-annex commands that operate on repositories can also operate on
clusters.
Clusters can only be accessed via ssh.
## configuring a cluster
A new cluster first needs to be initialized. Run [[git-annex-initcluster]] in
@ -90,9 +100,3 @@ For example:
By default, when a file is uploaded to a cluster, it is stored on every node of
the cluster. To control which nodes to store to, the [[preferred_content]] of
each node can be configured.
If the preferred content configuration of nodes make none of them
want a copy of a file, the upload to the cluster will fail. That is done to
avoid git-annex picking an arbitrary node. But, the user can bypass the
cluster and send content to any individual node, even if it's not preferred
content of that node.

View file

@ -33,12 +33,6 @@ For June's work on [[design/passthrough_proxy]], remaining todos:
* Support annex.jobs for clusters.
* On upload to a cluster, as well as fanout to nodes, if the key is
preferred content of the proxy repository, store it there.
(But not when preferred content is not configured.)
And on download from a cluster, if the proxy repository has the content,
get it from there to avoid the overhead of proxying to a node.
* Basic proxying to special remote support (non-streaming).
* Support proxies-of-proxies better, eg foo-bar-baz.