From 9cdbcedc37e6c252e5a406e2930c6139e44f3bc6 Mon Sep 17 00:00:00 2001
From: Joey Hess <joeyh@joeyh.name>
Date: Wed, 1 May 2024 11:08:10 -0400
Subject: [PATCH] additional design work on proxies

Sponsored-by: Dartmouth College's OpenNeuro project
---
 doc/design/passthrough_proxy.mdwn | 57 ++++++++++++++++++++++++++++---
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/doc/design/passthrough_proxy.mdwn b/doc/design/passthrough_proxy.mdwn
index 91830f41ce..70ad5cedf0 100644
--- a/doc/design/passthrough_proxy.mdwn
+++ b/doc/design/passthrough_proxy.mdwn
@@ -47,8 +47,8 @@ the cluster.
 Could the P2P protocol be extended to let the proxy communicate the UUIDs
 of all the repositories behind it?
 
-Once the client git-annex knows the set of UUIDs behind the proxy, it can
-instantiate a remote object per uuid, each of which accesses the proxy, but
+Once the client git-annex knows the set of UUIDs behind the proxy, it could
+eg instantiate a remote object per UUID, each of which accesses the proxy, but
 with a different UUID.
 
 But, git-annx usually only does UUID discovery the first time a ssh remote
@@ -64,8 +64,7 @@ git-annex branch?
 
 With this approach, git-annex would know as soon as it sees the proxy's
 UUID that this is a proxy for this other set of UUIDS. (Unless its
-git-annex branch is not up-to-date.) And then it can instantiate a UUID for
-each remote.
+git-annex branch is not up-to-date.)
 
 One difficulty with this is that, when the git-annex branch is not up to
 date with changes from the proxy, git-annex may try to access repositories
@@ -76,6 +75,56 @@ to store data when eg, all the repositories that is knows about are full.
 Just getting the git-annex back in sync should recover from either
 situation.
 
+## user interface
+
+What to name the instantiated remotes? Probably the best that could
+be done is to use the proxy's own remote names as suffixes on the client.
+Eg, the proxy's "node1" remote is "proxy-node1".
+
+But the user probably doesn't want to pick which node to send content to.
+They don't necessarily know anything about the nodes. Ideally the user
+would `git-annex copy --to proxy` or `git-annex push` and let it pick
+which instantiated remote(s) to send to.
+
+To make `git-annex copy --to proxy` work, `storeKey` could be changed to
+allow returning a UUID (or UUIDs) where the content was actually stored.
+That would also allow a single upload to the proxy to fan out and be stored
+in multiple nodes. The proxy would use preferred content to pick which of
+its nodes to store on.
+
+Instantiated remotes would still be needed for `git-annex get` and similar
+to work.
+
+To make `git-annex copy --from proxy` work, the proxy would need to pick
+a node and stream content from it. That's doable, but how to handle a case
+where a node gets corrupted? The best it could do is mark that node as no
+longer containing the content (as if a fsck failed) and try another one
+next time. This complication might not be necessary. Consider that
+while `git-annex copy --to foo` followed later by `git-annex copy --from foo`
+will usually work, it doesn't work when eg first copying to a transfer
+remote, which then sends the content elsewhere and drops its copy.
+
+What about dropping? `git-annex drop --from proxy` could be made to work,
+by having `removeKey` return a list of UUIDs that the content was dropped
+from. What should that do if it's able to drop from some nodes but not
+others? Perhaps it would need to be able to return a list of UUIDs that
+content was dropped from but still indicate it overall failed to drop.
+(Note that it's entirely possible that dropping from one node of the proxy
+involves lockContent on another node of the proxy in order to satisfy
+numcopies.)
+
+A command like `git-annex push` would see all the instantiated remotes and
+would pick one to send content to. Seems like the proxy might choose to
+`storeKey` the content on other node(s) than the requested one. Which would
+be fine. But, `git-annex push` would still do considerable extra work in
+interating over all the instantiated remotes. So it might be better to make
+such commands not operate on instantiated remotes for sending content but
+only on the proxy. 
+
+Commands like `git-annex push` and `git-annex pull`
+should also skip the instantiated remotes when pushing or pulling the git
+repo, because that would be extra work that accomplishes nothing.
+
 ## streaming to special remotes
 
 As well as being an intermediary to git-annex repositories, the proxy could