copied over some changes from proxy branch

2024-06-13 06:43:59 -04:00 · 2024-06-13 06:43:59 -04:00 · 22a329c57e
commit 22a329c57e
parent 345494e3b4
2 changed files with 198 additions and 74 deletions
--- a/doc/design/passthrough_proxy.mdwn
+++ b/doc/design/passthrough_proxy.mdwn
@ -15,7 +15,7 @@ existing remotes to keep up with changes are made on the server side.
 A proxy would avoid this complexity. It also allows limiting network
 ingress to a single point.

-Ideally a proxy would look like any other git-annex remote. All the files
+A proxy can be the frontend to a cluster. All the files
 stored anywhere in the cluster would be available to retrieve from the
 proxy. When a file is sent to the proxy, it would store it somewhere in the
 cluster.
@ -108,55 +108,169 @@ The only real difference seems to be that the UUID of a remote is cached,
 so A could only do this the first time we accessed it, and not later.
 With UUID discovery, A can do that at any time.

-## user interface
+## proxied remote names

 What to name the instantiated remotes? Probably the best that could
 be done is to use the proxy's own remote names as suffixes on the client.
 Eg, the proxy's "node1" remote is "proxy-node1".

-But the user probably doesn't want to pick which node to send content to.
-They don't necessarily know anything about the nodes. Ideally the user
-would `git-annex copy --to proxy` or `git-annex push` and let it pick
-which instantiated remote(s) to send to.
+But, the user might have their own "proxy-node1" remote configured that
+points to something else. To avoid a proxy changing the configuration of
+the user's remote to point to its remote, git-annex must avoid
+instantiating a proxied remote when there's already a configuration for a
+remote with that same name.

-To make `git-annex copy --to proxy` work, `storeKey` could be changed to
-allow returning a UUID (or UUIDs) where the content was actually stored.
-That would also allow a single upload to the proxy to fan out and be stored
-in multiple nodes. The proxy would use preferred content to pick which of
-its nodes to store on.
+That does mean that, if a user wants to set a git config for a proxy
+remote, they will need to manually set its annex-uuid and its url.
+Which is awkward. Many git configs of the proxy remote can be inherited by
+the instantiated remotes, so users won't often need to do that.

-Instantiated remotes would still be needed for `git-annex get` and similar
-to work.
+A user can also set up a remote with another name that they
+prefer, that points at a remote behind a proxy. They just need to set
+its annex-uuid and its url. Perhaps there should be a git-annex command
+that eases setting up a remote like that?

-To make `git-annex copy --from proxy` work, the proxy would need to pick
-a node and stream content from it. That's doable, but how to handle a case
-where a node gets corrupted? The best it could do is mark that node as no
-longer containing the content (as if a fsck failed) and try another one
-next time. This complication might not be necessary. Consider that
-while `git-annex copy --to foo` followed later by `git-annex copy --from foo`
-will usually work, it doesn't work when eg first copying to a transfer
-remote, which then sends the content elsewhere and drops its copy.
+## proxied remotes in git remote list

-What about dropping? `git-annex drop --from proxy` could be made to work,
-by having `removeKey` return a list of UUIDs that the content was dropped
-from. What should that do if it's able to drop from some nodes but not
-others? Perhaps it would need to be able to return a list of UUIDs that
-content was dropped from but still indicate it overall failed to drop.
-(Note that it's entirely possible that dropping from one node of the proxy
-involves lockContent on another node of the proxy in order to satisfy
-numcopies.)
+Should instantiated remotes have enough configured in git so that
+`git remote list` will list them? This would make things like tab
+completion of proxied remotes work, and would generally let the user
+discover that there *are* proxied remotes.
+
+This could be done by a config like remote.name.annex-proxied = true.
+That makes other configs of the remote not prevent it being used as an
+instantiated remote. So remote.name.annex-uuid can be changed when
+the uuid behind a proxy changes. And it allows updating remote.name.url
+to keep it the same as the proxy remote's url. (Or possibly to set it to
+something else?)
+
+Configuring the instantiated remotes like that would let anyone who can
+write to the git-annex branch flood other people's repos with configs
+for any number of git remotes. Which might be obnoxious.
+
+## single upload with fanout
+
+If we want to send a file to multiple repositories that are behind the same
+proxy, it would be wasteful to upload it through the proxy repeatedly.
+
+Perhaps a good user interface to this is `git-annex copy --to proxy`.
+The proxy could fan out the upload and store it in one or more nodes behind
+it. Using preferred content to select which nodes to use.
+This would need `storeKey` to be changed to allow returning a UUID (or UUIDs)
+where the content was actually stored.
+
+Alternatively, `git-annex copy --to proxy-foo` could notice that proxy-bar
+also wants the content, and fan out a copy to there. Then it could 
+record in its git-annex branch that the content is present in proxy-bar.
+If the user later does `git-annex copy --to proxy-bar`, it would avoid
+another upload (and the user would learn at that point that it was in
+proxy-bar). This avoids needing to change the `storeKey` interface.
+
+Should a proxy always fanout? if `git-annex copy --to proxy` is what does
+fanout, and `git-annex copy --to proxy-foo` doesn't, then the user has
+content. But if the latter does fanout, that might be annoying to users who
+want to use proxies, but want full control over what lands where, and don't
+want to use preferred content to do it. So probably fanout should be
+configurable. But it can't be configured client side, because the fanout
+happens on the proxy. Seems like remote.name.annex-fanout could be set to
+false to prevent fanout to a specific remote. (This is analagous to a
+remote having `git-annex assistant` running on it, it might fan out uploads
+to it to other repos, and only the owner of that repo can control it.)

 A command like `git-annex push` would see all the instantiated remotes and
-would pick one to send content to. Seems like the proxy might choose to
-`storeKey` the content on other node(s) than the requested one. Which would
-be fine. But, `git-annex push` would still do considerable extra work in
-iterating over all the instantiated remotes. So it might be better to make
-such commands not operate on instantiated remotes for sending content but
-only on the proxy. 
+would pick ones to send content to. If the proxy does fanout, this would
+lead to `git-annex push` doing extra work iterating over instantiated
+remotes that have already received content via fanout. Could this extra
+work be avoided?

-Commands like `git-annex push` and `git-annex pull`
-should also skip the instantiated remotes when pushing or pulling the git
-repo, because that would be extra work that accomplishes nothing.
+## clusters
+
+One way to use a proxy is just as a convenient way to access a group of
+remotes that are behind it. Some remotes may only be reachable by the
+proxy, but you still know what the individual remotes are. Eg, one might be
+a S3 bucket that can only be written via the proxy, but is globally
+readable without going through the proxy. Another might be a drive that is
+sometimes located behind the proxy, but other times connected directly.
+Using a proxy this way just involves using the instantiated proxied remotes.
+
+Or a proxy can be the frontend for a cluster. In this situation, the user
+doesn't know anything much about the nodes in the cluster, perhaps not even
+that they exist, or perhaps what keys are stored on which nodes.
+
+In the cluster case, the user would like to not need to pick a specific
+node to send content to. While they could use preferred content to pick a
+node, or nodes, they would prefer to be able to say `git-annex copy --to cluster` 
+and let it pick which nodes to send to. And similarly,
+`git-annex drop --from cluster' should drop the content from every node in
+the cluster.
+
+For this we need a UUID for the cluster. But it is not like a usual UUID.
+It does not need to actually be recorded in the location tracking logs, and
+it is not counted as a copy for numcopies purposes. The only point of this
+UUID is to make commands like `git-annex drop --from cluster` and
+`git-annex get --from cluster` talk to the cluster's frontend proxy, which
+has as its UUID the cluster's UUID.
+
+The cluster UUID is recorded in the git-annex branch, along with a list of
+the UUIDs of nodes of the cluster (which can change at any time).
+
+When reading a location log, if any UUID where content is present is part
+of the cluster, the cluster's UUID is added to the list of UUIDs.
+
+When writing a location log, the cluster's UUID is filtered out of the list
+of UUIDs.
+
+The cluster's frontend proxy fans out uploads to nodes according to
+preferred content. And `storeKey` is extended to be able to return a list
+of additional UUIDs where the content was stored. So an upload to the
+cluster will end up writing to the location log the actual nodes that it
+was fanned out to. 
+
+Note that to support clusters that are nodes of clusters, when a cluster's
+frontend proxy fans out an upload to a node, and `storeKey` returns
+additional UUIDs, it should pass those UUIDs along. Of course, no cluster
+can be a node of itself, and cycles have to be broken (as described in a
+section below).
+
+When a file is requested from the cluster's frontend proxy, it can send its
+own local copy if it has one, but otherwise it will proxy to one of its
+nodes. (How to pick which node to use? Load balancing?) This behavior will
+need to be added to git-annex-shell, and to Remote.Git for local paths to a
+cluster.
+
+The cluster's frontend proxy also fans out drops to all nodes, attempting
+to drop content from the whole cluster, and only indicating success if it
+can. Also needs changes to git-annex-sjell and Remote.Git.
+
+It does not fan out lockcontent, instead the client will lock content
+on specific nodes. In fact, the cluster UUID should probably be omitted
+when constructing a drop proof, since trying to lockcontent on it will
+usually fail.
+
+Some commands like `git-annex whereis` will list content as being stored in
+the cluster, as well as on whicheven of its nodes, and whereis currently
+says "n copies", but since the cluster doesn't count as a copy, that
+display should probably be counted using the numcopies logic that excludes
+cluster UUIDs.
+
+No other protocol extensions or special cases should be needed. Except for
+the strange case of content stored in the cluster's frontend proxy.
+
+Running `git-annex fsck --fast` on the cluster's frontend proxy will look
+weird: For each file, it will read the location log, and if the file is
+present on any node it will add the frontend proxy's UUID. So fsck will
+expect the content to be present. But it probably won't be. So it will fix
+the location log... which will make no changes since the proxy's UUID will
+be filtered out on write. So probably fsck will need a special case to
+avoid this behavior. (Also for `git-annex fsck --from cluster --fast`)
+
+And if a key does get stored on the cluster's frontend proxy, it will not
+be possible to tell from looking at the location log that the content is
+really present there. So that won't be counted as a copy. In some cases,
+a cluster's frontend proxy may want to keep files, perhaps some files are
+worth caching there for speed. But if a file is stored only on the
+cluster's frontend proxy and not in any of its nodes, clients will not
+consider the cluster to contain the file at all.

 ## speed

@ -246,6 +360,23 @@ in front of the proxy.

 ## cycles

+A repo can advertise that it proxies for a repo which has the same uuid as
+itself. Or there can be a larger cycle involving a proxy that proxies to a
+proxy, etc.
+
+Since the proxied repo uuid is communicated to git-annex-shell via 
+--uuid, a repo that advertises proxying for itself will be connected to
+with its own uuid. No proxying is done in this case. Same happens with a
+larger cycle.
+
+Instantiating remotes needs to identity cycles and break them. Otherwise
+it would construct an infinite number of proxied remotes with names
+like "foo-foo-foo-foo-..." or "foo-bar-foo-bar-..."
+
+Once `git-annex copy --to proxy` is implemented, and the proxy decides
+where to send content that is being sent directly to it, cycles will
+become an issue with that as well.
+
 What if repo A is a proxy and has repo B as a remote. Meanwhile, repo B is
 a proxy and has repo A as a remote?

@ -259,7 +390,7 @@ remote that is not part of a cycle, they could deposit the upload there and
 the upload still succeed. Otherwise the upload would fail, which is
 probably the best that can be done with such a broken configuration.

-So, it seems like proxies will need to take transfer locks for uploads,
+So, it seems like proxies would need to take transfer locks for uploads,
 even though the content is being proxied to elsewhere.

 Dropping could have similar cycles with content presence locking, which
--- a/doc/todo/git-annex_proxies.mdwn
+++ b/doc/todo/git-annex_proxies.mdwn
@ -26,52 +26,45 @@ In development on the `proxy` branch.

 For June's work on [[design/passthrough_proxy]], implementation plan:

-1. UUID discovery via git-annex branch. Add a log file listing UUIDs
-   accessible via proxy UUIDs. It also will contain the names
-   of the remotes that the proxy is a proxy for, 
-   from the perspective of the proxy. (done)
+* UUID discovery via git-annex branch. Add a log file listing UUIDs
+  accessible via proxy UUIDs. It also will contain the names
+  of the remotes that the proxy is a proxy for, 
+  from the perspective of the proxy. (done)

-1. Add `git-annex updateproxy` command and remote.name.annex-proxy
-   configuration. (done)
+* Add `git-annex updateproxy` command and remote.name.annex-proxy
+  configuration. (done)

-2. Remote instantiation for proxies. (done)
+* Remote instantiation for proxies. (done)

-2. Bug: In a repo cloned with ssh from a proxy repo,
-   running `git-annex init` sets annex-uuid for the instantiated remotes.
-   This prevents them being used, because instanatiation is not done
-   when there's any config set for a remote.
+* Implement git-annex-shell proxying to git remotes. (done)

-3. Implement proxying in git-annex-shell.
-   (Partly done, still need it for GET, PUT, CONNECT, and NOTIFYCHANGES
-   messages.)
+* Proxy should update location tracking information for proxied remotes,
+  so it is available to other users who sync with it. (done)

-4. Either implement proxying for local path remotes, or prevent
-   listProxied from operating on them.
+* Consider getting instantiated remotes into git remote list.
+  See design.

-4. Either implement proxying for tor-annex remotes, or prevent
-   listProxied from operating on them.
+* Implement single upload with fanout to proxied remotes.

-4. Let `storeKey` return a list of UUIDs where content was stored,
-   and make proxies accept uploads directed at them, rather than a specific
-   instantiated remote, and fan out the upload to whatever nodes behind
-   the proxy want it. This will need P2P protocol extensions.
+* Implement clusters.

-5. Make `git-annex copy --from $proxy` pick a node that contains each
-   file, and use the instantiated remote for getting the file. Same for
-   similar commands.
+* Support proxies-of-proxies better, eg foo-bar-baz.
+  Currently, it does work, but have to run `git-annex updateproxy`
+  on foo in order for it to notice the bar-baz proxied remote exists,
+  and record it as foo-bar-baz. Make it skip recording proxies of
+  proxies like that, and instead automatically generate those from the log.
+  (With cycle prevention there of course.)

-6. Make `git-annex drop --from $proxy` drop, when possible, from every
-   remote accessible by the proxy. Communicate partial drops somehow.
+* Cycle prevention. See design.

-7. Make commands like `git-annex push` not iterate over instantiate
-   remotes, and instead just send content to the proxy for fanout.
+* Optimise proxy speed. See design for ideas.

-8. Optimise proxy speed. See design for idea.
+* Use `sendfile()` to avoid data copying overhead when
+  `receiveBytes` is being fed right into `sendBytes`.

-9. Encryption and chunking. See design for issues.
+* Encryption and chunking. See design for issues.

-10. Cycle prevention. See design.
+* Indirect uploads (to be considered). See design.

-11. indirect uploads (to be considered). See design.
-
-12. Support using a proxy when its url is a P2P address.
+* Support using a proxy when its url is a P2P address.
+  (Eg tor-annex remotes.)