gave up on upload fanout to cluster's proxy

The problem with that idea is that the cluster's proxy is necessarily a remote, and necessarily one that we'll want to sync with, since the git repository is stored there. So when its preferred content wants a file, and the cluster does too, the file will get uploaded to it as well as to the cluster. With fanout, the upload to the cluster will populate the proxy as well, avoiding a second upload. But only if the file is sent to the cluster first. If it's sent to the proxy first, there will be two uploads. Another, lesser problem is that a repository can proxy for more than one cluster. So when does it make sense to drop content from the repository? It could be done when dropping from one cluster, but what of the other one? This complication was not necessary anyway. Instead, if it's desirable to have some content accessed from close to the proxy, one of the cluster nodes can just be put on the same filesystem as it. That will be just as fast as storing the content on the proxy.
2024-06-25 13:35:12 -04:00 · 2024-06-25 13:35:12 -04:00 · 5ede109ae5
commit 5ede109ae5
parent 1bfe7f8a53
2 changed files with 18 additions and 20 deletions
--- a/doc/clusters.mdwn
+++ b/doc/clusters.mdwn
@ -6,23 +6,35 @@ logical repository.

 ## using a cluster

-For example, a remote "bigserver" that is configured as a cluster will
-make available an additional remote "bigserver-mycluster", as well as some
-remotes for each node eg "bigserver-node1", "bigserver-node2", etc.
+To use a cluster, your repository needs to have a remote that serves the
+cluster. Clusters can currently only be accessed via ssh. This remote
+is added the same as any other remote:

-The user can get files from the cluster without caring which node it comes
+    git remote add bigserver me@bigserver:annex
+
+The remote publishes information about the cluster that it serves
+to the git-annex branch. (See below for how that is configured.) So you may
+need to fetch from it to learn about the cluster that it serves:
+
+    git fetch bigserver
+
+That will make available an additional remote for the cluster, eg
+"bigserver-mycluster", as well as some remotes for each node eg
+"bigserver-node1", "bigserver-node2", etc.
+
+You can get files from the cluster without caring which node it comes
 from:

    $ git-annex get foo --from bigserver-mycluster
    copy foo (from bigserver-mycluster...) ok

-And the user can send files to the cluster, without caring what nodes
+And you can send files to the cluster, without caring what nodes
 they are stored to:

    $ git-annex move bar --to bigserver-mycluster
    move bar (to bigserver-mycluster...) ok

-In fact, a single upload can be sent to every node of the cluster at once.
+In fact, a single upload can be sent to every node of the cluster at once. 
    
    $ git-annex whereis bar
 	whereis bar (3 copies)
@ -38,8 +50,6 @@ so the 3 copies are the copies on individual nodes.
 Most other git-annex commands that operate on repositories can also operate on
 clusters.

-Clusters can only be accessed via ssh.
-
 ## configuring a cluster

 A new cluster first needs to be initialized. Run [[git-annex-initcluster]] in
@ -90,9 +100,3 @@ For example:
 By default, when a file is uploaded to a cluster, it is stored on every node of
 the cluster. To control which nodes to store to, the [[preferred_content]] of
 each node can be configured.
-
-If the preferred content configuration of nodes make none of them
-want a copy of a file, the upload to the cluster will fail. That is done to
-avoid git-annex picking an arbitrary node. But, the user can bypass the
-cluster and send content to any individual node, even if it's not preferred
-content of that node.
--- a/doc/todo/git-annex_proxies.mdwn
+++ b/doc/todo/git-annex_proxies.mdwn
@ -33,12 +33,6 @@ For June's work on [[design/passthrough_proxy]], remaining todos:

 * Support annex.jobs for clusters.

-* On upload to a cluster, as well as fanout to nodes, if the key is
-  preferred content of the proxy repository, store it there.
-  (But not when preferred content is not configured.)
-  And on download from a cluster, if the proxy repository has the content,
-  get it from there to avoid the overhead of proxying to a node.
-
 * Basic proxying to special remote support (non-streaming).

 * Support proxies-of-proxies better, eg foo-bar-baz.