designing clusters

2024-06-12 14:45:39 -04:00 · 2024-06-12 14:45:39 -04:00 · a986a20034
commit a986a20034
parent e70e3473b3
2 changed files with 75 additions and 56 deletions
--- a/doc/design/passthrough_proxy.mdwn
+++ b/doc/design/passthrough_proxy.mdwn
@ -15,7 +15,7 @@ existing remotes to keep up with changes are made on the server side.
 A proxy would avoid this complexity. It also allows limiting network
 ingress to a single point.
-Ideally a proxy would look like any other git-annex remote. All the files
+A proxy can be the frontend to a cluster. All the files
 stored anywhere in the cluster would be available to retrieve from the
 proxy. When a file is sent to the proxy, it would store it somewhere in the
 cluster.
@ -148,51 +148,76 @@ Configuring the instantiated remotes like that would let anyone who can
 write to the git-annex branch flood other people's repos with configs
 for any number of git remotes. Which might be obnoxious.
-## user interface
+## single upload with fanout
-But the user probably doesn't want to pick which node to send content to.
+If we want to send a file to multiple repositories that are behind the same
-They don't necessarily know anything about the nodes. Ideally the user
+proxy, it would be wasteful to upload it through the proxy repeatedly.
 would `git-annex copy --to proxy` or `git-annex push` and let it pick
 which proxied remote(s) to send to.
-To make `git-annex copy --to proxy` work, `storeKey` could be changed to
+Perhaps a good user interface to this is `git-annex copy --to proxy`.
-allow returning a UUID (or UUIDs) where the content was actually stored.
+The proxy could fan out the upload and store it in one or more nodes behind
-That would also allow a single upload to the proxy to fan out and be stored
+it. Using preferred content to select which nodes to use.
-in multiple nodes. The proxy would use preferred content to pick which of
+This would need `storeKey` to be changed to allow returning a UUID (or UUIDs)
-its nodes to store on.
+where the content was actually stored.
-Instantiated remotes would still be needed for `git-annex get` and similar
+Alternatively, `git-annex copy --to proxy-foo` could notice that proxy-bar
-to work.
+also wants the content, and fan out a copy to there. Then it could 
-
+record in its git-annex branch that the content is present in proxy-bar.
-To make `git-annex copy --from proxy` work, the proxy would need to pick
+If the user later does `git-annex copy --to proxy-bar`, it would avoid
-a node and stream content from it. That's doable, but how to handle a case
+another upload (and the user would learn at that point that it was in
-where a node gets corrupted? The best it could do is mark that node as no
+proxy-bar). This avoids needing to change the `storeKey` interface.
 longer containing the content (as if a fsck failed) and try another one
 next time. This complication might not be necessary. Consider that
 while `git-annex copy --to foo` followed later by `git-annex copy --from foo`
 will usually work, it doesn't work when eg first copying to a transfer
 remote, which then sends the content elsewhere and drops its copy.
 What about dropping? `git-annex drop --from proxy` could be made to work,
 by having `removeKey` return a list of UUIDs that the content was dropped
 from. What should that do if it's able to drop from some nodes but not
 others? Perhaps it would need to be able to return a list of UUIDs that
 content was dropped from but still indicate it overall failed to drop.
 (Note that it's entirely possible that dropping from one node of the proxy
 involves lockContent on another node of the proxy in order to satisfy
 numcopies.)
 A command like `git-annex push` would see all the instantiated remotes and
-would pick one to send content to. Seems like the proxy might choose to
+would pick ones to send content to. If the proxy does fanout, this would
-`storeKey` the content on other node(s) than the requested one. Which would
+lead to `git-annex push` doing extra work iterating over instantiated
-be fine. But, `git-annex push` would still do considerable extra work in
+remotes that have already received content via fanout. Could this extra
-iterating over all the instantiated remotes. So it might be better to make
+work be avoided?
 such commands not operate on instantiated remotes for sending content but
 only on the proxy. 
-Commands like `git-annex push` and `git-annex pull`
+## clusters
-should also skip the instantiated remotes when pushing or pulling the git
+
-repo, because that would be extra work that accomplishes nothing.
+One way to use a proxy is just as a convenient way to access a group of
 remotes that are behind it. Some remotes may only be reachable by the
 proxy, but you still know what the individual remotes are. Eg, one might be
 a S3 bucket that can only be written via the proxy, but is globally
 readable without going through the proxy. Another might be a drive that is
 sometimes located behind the proxy, but other times connected directly.
 Using a proxy this way just involves using the instantiated proxied remotes.
 Or a proxy can be the frontend for a cluster. In this situation, the user
 doesn't know anything much about the nodes in the cluster, perhaps not even
 that they exist, or perhaps what keys are stored on which nodes.
 In the cluster case, the user would like to not need to pick a specific
 node to send content to. While they could use preferred content to pick a
 node, or nodes, they would prefer to be able to say `git-annex copy --to cluster` 
 and let it pick which proxied remote(s) to send to. And similarly,
 `git-annex drop --from cluster' should drop the content from every node in
 the cluster.
 Let's suppose there is a config to say that a repository is a proxy for
 a cluster. The cluster gets its own UUID. This is not the same as the UUID
 of the proxy repository. 
 When a cluster is a remote, its annex-uuid is the cluster UUID.
 No proxied remotes are instantiated for a cluster.
 Copying to a cluster would cause the transfer to be proxied to one or more
 nodes. The location log would be updated to say the key is present in the
 cluster UUID. The cluster proxy would also record the UUIDs of the nodes
 where the content was stored, since it does need to remember that.
 But it would not need to expose that;the nodes might have annex-private set.
 Getting from a cluster would pick a node that has the content and
 proxy a transfer from that node.
 Dropping from a cluster would drop from every node that has the
 content. Once the content is entirely gone from the cluster, it would
 record it not present in the cluster's UUID. (If some drops failed, the
 overall drop would fail.)
 Checkpresent to a cluster would proxy a checkpresent to nodes until it
 found one does have the content.
 Lockcontent to a cluster would lock the content on one (or more?) nodes.
 ## speed
--- a/doc/todo/git-annex_proxies.mdwn
+++ b/doc/todo/git-annex_proxies.mdwn
@ -44,6 +44,15 @@ For June's work on [[design/passthrough_proxy]], implementation plan:
 * Consider getting instantiated remotes into git remote list.
  See design.
 * Make commands like `git-annex sync` not git push/pull to proxied remotes.
  That doesn't work because they have no url. Or, if proxied remotes are in
  git remote list, it is unncessary work because it's the same url as the
  proxy.
 * Implement single upload with fanout to proxied remotes.
 * Implement clusters.
 * Support proxies-of-proxies better, eg foo-bar-baz.
  Currently, it does work, but have to run `git-annex updateproxy`
  on foo in order for it to notice the bar-baz proxied remote exists,
@ -53,21 +62,6 @@ For June's work on [[design/passthrough_proxy]], implementation plan:
 * Cycle prevention. See design.
 * Make `git-annex copy --from $proxy` pick a node that contains each
  file, and use the instantiated remote for getting the file. Same for
  similar commands.
 * Make `git-annex drop --from $proxy` drop, when possible, from every
  remote accessible by the proxy. Communicate partial drops somehow.
 * Let `storeKey` return a list of UUIDs where content was stored,
  and make proxies accept uploads directed at them, rather than a specific
  instantiated remote, and fan out the upload to whatever nodes behind
  the proxy want it. This will need P2P protocol extensions.
 * Make commands like `git-annex push` not iterate over instantiated
  remotes, and instead just send content to the proxy for fanout.
 * Optimise proxy speed. See design for ideas.
 * Use `sendfile()` to avoid data copying overhead when
@ -75,7 +69,7 @@ For June's work on [[design/passthrough_proxy]], implementation plan:
 * Encryption and chunking. See design for issues.
-* indirect uploads (to be considered). See design.
+* Indirect uploads (to be considered). See design.
 * Support using a proxy when its url is a P2P address.
  (Eg tor-annex remotes.)