git-annex/doc/design/passthrough_proxy.mdwn

490 lines
23 KiB
Markdown

[[!toc ]]
## motivations
When [[balanced_preferred_content]] is used, there may be many repositories
in a location -- either a server or a cluster -- and getting any given file
may need to access any of them. Configuring remotes for each repository
adds a lot of complexity, both in setting up access controls on each
server, and for the user.
Particularly on the user side, when ssh is used they may have to deal with
many different ssh host keys, as well as adding new remotes or removing
existing remotes to keep up with changes are made on the server side.
A proxy would avoid this complexity. It also allows limiting network
ingress to a single point.
A proxy can be the frontend to a cluster. All the files
stored anywhere in the cluster would be available to retrieve from the
proxy. When a file is sent to the proxy, it would store it somewhere in the
cluster.
Currently the closest git-annex can get to implementing such a proxy is a
transfer repository that wants all content that is not yet stored in the
cluster. This allows incoming transfers to be accepted and distributed to
nodes of the cluster. To get data back out of the cluster, there has to be
some communication that it is preferred content (eg, setting metadata),
then after some delay for it to be copied back to the transfer repository,
it becomes available for the client to download it. And once it knows the
client has its copy, it can be removed from the transfer repository.
That is quite slow, and rather clumsy. And it risks the transfer repository
filling up with data that has been requested by clients that have not yet
picked it up, or with incoming transfers that have not yet reached the
cluster.
A proxy would not hold the content of files itself. It would be a clone of
the git repository though, probably. Uploads and downloads would stream
through the proxy.
## protocol
The git-annex [[P2P_protocol]] would be relayed via the proxy,
which would be a regular git ssh remote.
There is also the possibility of relaying the P2P protocol over another
protocol such as HTTP, see [[P2P_protocol_over_http]].
## UUID discovery
A significant difficulty in implementing a proxy is that each git-annex
remote has a single UUID. But the remote that points at the proxy can't
just have the UUID of the proxy's repository, git-annex needs to know that
the proxy's remote can be used to access repositories with every UUID in
the cluster.
### UUID discovery via P2P protocol extension
Could the P2P protocol be extended to let the proxy communicate the UUIDs
of all the repositories behind it?
Once the client git-annex knows the set of UUIDs behind the proxy, it could
eg instantiate a remote object per UUID, each of which accesses the proxy, but
with a different UUID.
But, git-annex usually only does UUID discovery the first time a ssh remote
is accessed. So it would need to discover at that point that the remote is
a proxy. Then it could do UUID discovery each time git-annex starts up.
But that adds significant overhead, git-annex would be making a connection
to the proxy in situations where it is not going to use it.
### UUID discovery via git-annex branch
Could the proxy's set of UUIDs instead be recorded somewhere in the
git-annex branch?
With this approach, git-annex would know as soon as it sees the proxy's
UUID that this is a proxy for this other set of UUIDS. (Unless its
git-annex branch is not up-to-date.)
One difficulty with this is that, when the git-annex branch is not up to
date with changes from the proxy, git-annex may try to access repositories
that are no longer available behind the proxy. That failure would be
handled the same as any other currently unavailable repository. Also
git-annex would not use the full set of repositories, so might not be able
to store data when eg, all the repositories that is knows about are full.
Just getting the git-annex back in sync should recover from either
situation.
> This seems like the clear winner.
## UUID discovery security
Are there any security concerns with adding UUID discovery?
Suppose that repository A claims to be a proxy for repository B, but it's
not connected to B, and is actually evil. Then git-annex would instantiate
a remote A-B with the UUID of B. If files were sent to A-B, git-annex would
consider them present on B, and not send them to B by other remotes.
Well, in this situation, A wrote to the git-annex branch (or used a P2P
protocol extension) in order to pose as B. Without a proxy feature A could
just as well falsify location logs to claim that B contains things it did
not. Also, without a proxy feature, A could set its UUID to be the same as
B, and so trick us into sending files to it rather than B.
The only real difference seems to be that the UUID of a remote is cached,
so A could only do this the first time we accessed it, and not later.
With UUID discovery, A can do that at any time.
## proxied remote names
What to name the instantiated remotes? Probably the best that could
be done is to use the proxy's own remote names as suffixes on the client.
Eg, the proxy's "node1" remote is "proxy-node1".
But, the user might have their own "proxy-node1" remote configured that
points to something else. To avoid a proxy changing the configuration of
the user's remote to point to its remote, git-annex must avoid
instantiating a proxied remote when there's already a configuration for a
remote with that same name.
That does mean that, if a user wants to set a git config for a proxy
remote, they will need to manually set its annex-uuid and its url.
Which is awkward. Many git configs of the proxy remote can be inherited by
the instantiated remotes, so users won't often need to do that.
A user can also set up a remote with another name that they
prefer, that points at a remote behind a proxy. They just need to set
its annex-uuid and its url. Perhaps there should be a git-annex command
that eases setting up a remote like that?
## proxied remotes in git remote list
Should instantiated remotes have enough configured in git so that
`git remote list` will list them? This would make things like tab
completion of proxied remotes work, and would generally let the user
discover that there *are* proxied remotes.
This could be done by a config like remote.name.annex-proxied = true.
That makes other configs of the remote not prevent it being used as an
instantiated remote. So remote.name.annex-uuid can be changed when
the uuid behind a proxy changes. And it allows updating remote.name.url
to keep it the same as the proxy remote's url. (Or possibly to set it to
something else?)
Configuring the instantiated remotes like that would let anyone who can
write to the git-annex branch flood other people's repos with configs
for any number of git remotes. Which might be obnoxious.
## single upload with fanout
If we want to send a file to multiple repositories that are behind the same
proxy, it would be wasteful to upload it through the proxy repeatedly.
Perhaps a good user interface to this is `git-annex copy --to proxy`.
The proxy could fan out the upload and store it in one or more nodes behind
it. Using preferred content to select which nodes to use.
This would need `storeKey` to be changed to allow returning a UUID (or UUIDs)
where the content was actually stored.
Alternatively, `git-annex copy --to proxy-foo` could notice that proxy-bar
also wants the content, and fan out a copy to there. Then it could
record in its git-annex branch that the content is present in proxy-bar.
If the user later does `git-annex copy --to proxy-bar`, it would avoid
another upload (and the user would learn at that point that it was in
proxy-bar). This avoids needing to change the `storeKey` interface.
Should a proxy always fanout? if `git-annex copy --to proxy` is what does
fanout, and `git-annex copy --to proxy-foo` doesn't, then the user has
content. But if the latter does fanout, that might be annoying to users who
want to use proxies, but want full control over what lands where, and don't
want to use preferred content to do it. So probably fanout should be
configurable. But it can't be configured client side, because the fanout
happens on the proxy. Seems like remote.name.annex-fanout could be set to
false to prevent fanout to a specific remote. (This is analagous to a
remote having `git-annex assistant` running on it, it might fan out uploads
to it to other repos, and only the owner of that repo can control it.)
A command like `git-annex push` would see all the instantiated remotes and
would pick ones to send content to. If the proxy does fanout, this would
lead to `git-annex push` doing extra work iterating over instantiated
remotes that have already received content via fanout. Could this extra
work be avoided?
## clusters
One way to use a proxy is just as a convenient way to access a group of
remotes that are behind it. Some remotes may only be reachable by the
proxy, but you still know what the individual remotes are. Eg, one might be
a S3 bucket that can only be written via the proxy, but is globally
readable without going through the proxy. Another might be a drive that is
sometimes located behind the proxy, but other times connected directly.
Using a proxy this way just involves using the instantiated proxied remotes.
Or a proxy can be the frontend for a cluster. In this situation, the user
doesn't know anything much about the nodes in the cluster, perhaps not even
that they exist, or perhaps what keys are stored on which nodes.
In the cluster case, the user would like to not need to pick a specific
node to send content to. While they could use preferred content to pick a
node, or nodes, they would prefer to be able to say `git-annex copy --to cluster`
and let it pick which nodes to send to. And similarly,
`git-annex drop --from cluster' should drop the content from every node in
the cluster.
For this we need a UUID for the cluster. But it is not like a usual UUID.
It does not need to actually be recorded in the location tracking logs, and
it is not counted as a copy for numcopies purposes. The only point of this
UUID is to make commands like `git-annex drop --from cluster` and
`git-annex get --from cluster` talk to the cluster's frontend proxy.
The proxy log contains the cluster UUID (with a remote name like
"cluster"), as well as the UUIDs of the nodes of the cluster.
This makes the client access the cluster using the proxy. Note that more
than one proxy can be in front of the same cluster, and multiple clusters
can be accessed via the same proxy.
The cluster UUID is recorded in the git-annex branch, along with a list of
the UUIDs of nodes of the cluster (which can change at any time).
When reading a location log, if any UUID where content is present is part
of the cluster, the cluster's UUID is added to the list of UUIDs.
When writing a location log, the cluster's UUID is filtered out of the list
of UUIDs.
When proxying an upload to the cluster's UUID, git-annex-shell fans out
uploads to nodes according to preferred content. And `storeKey` is extended
to be able to return a list of additional UUIDs where the content was
stored. So an upload to the cluster will end up writing to the location log
the actual nodes that it was fanned out to.
Note that to support clusters that are nodes of clusters, when a cluster's
frontend proxy fans out an upload to a node, and `storeKey` returns
additional UUIDs, it should pass those UUIDs along. Of course, no cluster
can be a node of itself, and cycles have to be broken (as described in a
section below).
When a file is requested from the cluster's UUID, git-annex-shell picks one
of the nodes that has the content, and proxies to that one.
(How to pick which node to use? Load balancing?)
And, if the proxy repository itself contains the requested key, it can send
it directly. This allows the proxy repository to be primed with frequently
accessed files when it has the space.
When a drop is requested from the cluster's UUID, git-annex-shell drops
from all nodes, as well as from the proxy itself. Only indicating success
if it is able to delete all copies from the cluster.
It does not fan out lockcontent, instead the client will lock content
on specific nodes. In fact, the cluster UUID should probably be omitted
when constructing a drop proof, since trying to lockcontent on it will
always fail.
Some commands like `git-annex whereis` will list content as being stored in
the cluster, as well as on whichever of its nodes, and whereis currently
says "n copies", but since the cluster doesn't count as a copy, that
display should probably be counted using the numcopies logic that excludes
cluster UUIDs.
No other protocol extensions or special cases should be needed.
## speed
A passthrough proxy should be as fast as possible so as not to add overhead
to a file retrieve, store, or checkpresent. This probably means that
it keeps TCP connections open to each host in the cluster. It might use a
protocol with less overhead than ssh.
In the case of checkpresent, it would be possible for the proxy to not
communicate with the cluster to check that the data is still present on it.
As long as all access is intermediated via the proxy, its git-annex branch
could be relied on to always be correct, in theory. Proving that theory,
making sure to account for all possible race conditions and other scenarios,
would be necessary for such an optimisation.
Another way the proxy could speed things up is to cache some subset of
content. Eg, analize what files are typically requested, and store another
copy of those on the proxy. Perhaps prioritize storing smaller files, where
latency tends to swamp transfer speed.
## streaming to special remotes
As well as being an intermediary to git-annex repositories, the proxy could
provide access to other special remotes. That could be an object store like
S3, which might be internal to the cluster or not. When using a cloud
service like S3, only the proxy needs to know the access credentials.
Currently git-annex does not support streaming content to special remotes.
The remote interface operates on object files stored on disk. See
[[todo/transitive_transfers]] for discussion of that problem. If proxies
get implemented, that problem should be revisited.
## chunking
When the proxy is in front of a special remote that is chunked,
where does the chunking happen? It could happen on the client, or on the
proxy.
Git remotes don't ever do chunking currently, so chunking on the client
would need changes there.
Also, a given upload via a proxy may get sent to several special remotes,
each with different chunk sizes, or perhaps some not chunked and some
chunked. For uploads to be efficient, chunking needs to happen on the proxy.
## encryption
When the proxy is in front of a special remote that uses encryption, where
does the encryption happen? It could either happen on the client before
sending to the proxy, or the proxy could do the encryption since it
communicates with the special remote.
If the client does not want the proxy to see unencrypted data,
they would obviously prefer encryption happens locally.
But, the proxy could be the only thing that has access to a security key
that is used in encrypting a special remote that's located behind it.
There's a security benefit there too.
So there are kind of two different perspectives here that can have
different opinions.
Also if encryption for a special remote behind a proxy happened
client-side, and the client relied on that, nothing would stop the proxy
from replacing that encrypted special remote with an unencrypted remote.
Then the client side encryption would not happen, the user would not
notice, and the proxy could see their unencrypted content.
Of course, if a client really wanted to, they could make a special remote
that uses the remote behind the proxy as a key/value backend.
Then the client could encrypt locally.
On the implementation side, git-annex's git remotes don't currently ever do
encryption. And special remotes don't communicate via the P2P protocol with
a git remote. So none of git-annex's existing remote implementations would
be able to handle client-side encryption.
There's potentially a layering problem here, because exactly how encryption
works can vary depending on the type of special remote.
Encrypted and chunked special remotes first chunk, then encrypt.
So it chunking happens on the proxy, encryption *must* also happen there.
So overall, it seems better to do proxy-side encryption. But it may be
worth adding a special remote that does its own client-side encryption
in front of the proxy.
## cycles
A repo can advertise that it proxies for a repo which has the same uuid as
itself. Or there can be a larger cycle involving a proxy that proxies to a
proxy, etc.
Since the proxied repo uuid is communicated to git-annex-shell via
--uuid, a repo that advertises proxying for itself will be connected to
with its own uuid. No proxying is done in this case. Same happens with a
larger cycle.
Instantiating remotes needs to identity cycles and break them. Otherwise
it would construct an infinite number of proxied remotes with names
like "foo-foo-foo-foo-..." or "foo-bar-foo-bar-..."
Once `git-annex copy --to proxy` is implemented, and the proxy decides
where to send content that is being sent directly to it, cycles will
become an issue with that as well.
What if repo A is a proxy and has repo B as a remote. Meanwhile, repo B is
a proxy and has repo A as a remote?
An upload to repo A will start by checking if repo B wants the content and if so,
start an upload to repo B. Then the same happens on repo B, leading it to
start an upload to repo A.
At this point, it might be possible for git-annex to detect the cycle,
if the proxy uses a transfer lock file. If repo B or repo A had some other
remote that is not part of a cycle, they could deposit the upload there and
the upload still succeed. Otherwise the upload would fail, which is
probably the best that can be done with such a broken configuration.
So, it seems like proxies would need to take transfer locks for uploads,
even though the content is being proxied to elsewhere.
Dropping could have similar cycles with content presence locking, which
needs to be thought through as well. A cycle of the actual dropContent
operation might also be possible.
## exporttree=yes
Could the proxy be in front of a special remote that uses exporttree=yes?
Some possible approaches:
* Proxy caches files somewhere until all the files in the configured
annex-tracking-branch are available, then exports them all to the special
remote.
* Proxy exports each file to the special remote as it is received.
It records an incomplete tree export after each export.
Once all files in the configured annex-tracking-branch have been sent,
it records a completed tree export. This seems possible, it's similar
to `git-annex export --to=remote` recovering after having been
interrupted.
* Proxy storeExport and all related export/import actions. This would need
a large expansion of the P2P protocol.
The first two approaches need some way to communicate the
configured annex-tracking-branch over the P2P protocol. Or to communicate
the tree that it currently points to.
A proxy for a git repo does not proxy access to the git repo itself, so
`git push origin-foo master` actually pushes the ref to the proxy's own git
repo. Perhaps this points in a direction of how the proxy could learn what
tree to export to exporttree=yes remotes. But only vaguely since how would
it pick which of multiple branches to export?
Perhaps configure the annex-tracking-branch in the git-annex branch?
That might be generally useful when working with exporttree=yes remotes.
The first two approaches also have a complication when a key is sent to
the proxy that is not part of the configured annex-tracking-branch. What
does the proxy do with it? There seem three possibilities:
1. Reject the transfer of the key.
2. Send the key to another proxied remote that is not exporttree=yes
(and get it from there later if needed to finish populating an export)
3. Store the key locally. (Not desirable because proxy repos may be on
small disks as they don't usually need to hold any files.)
The third approach would mean the user needs to use `git-annex export --to`
in order to update proxied exporttree remotes. Which gets in the way of the
other proxy workflows and requires them to know that the proxy has an
exporttree remote behind it.
Tentative design for exporttree=yes with proxies:
* Configure annex-tracking-branch for the proxy in the git-annex branch.
(For the proxy as a whole, or for specific exporttree=yes repos behind
it?)
* Then the user's workflow is simply: `git-annex push proxy`
* sync/push need to first push any updated annex-tracking-branch to the
proxy before sending content to it. (Currently sync only pushes at the
end.)
* If proxied remotes are all exporttree=yes, the proxy rejects any
transfers of a key that is not in the annex-tracking-branch that it
currently knows about. If there is any other proxied remote, the proxy
can direct such transfers to it.
* Upon receiving a new annex-tracking-branch or any transfer of a key
used in the current annex-tracking-branch, the proxy can update
the exporttree=yes remotes. This needs to happen incrementally,
eg upon receiving a key, just proxy it on to the exporttree=yes remote,
and update the export database. Once all keys are received, update
the git-annex branch to indicate a new tree has been exported.
* Upon receiving a git push of the annex-tracking-branch, a proxy might
be able to get all the changed objects from non-exporttree=yes proxied
remotes that contain them. If so it can update the exporttree=yes
remote automatically and inexpensively. At the same time, a
`git-annex push` will be attempting to send those same objects.
So somehow the proxy will need to manage this situation.
## possible enhancement: indirect uploads
(Thanks to Chris Markiewicz for this idea.)
When a client wants to upload an object, the proxy could indicate that the
upload should not be sent to it, but instead be PUT to a HTTP url that it
provides to the client.
An example use case involves
[presigned S3 urls](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html).
When one of the proxy's nodes is a S3 bucket, having the client upload
directly to S3 would avoid needing double traffic through the proxy's
network.
This would need a special remote that generates the presigned S3 url.
Probably an external, so the external special remote protocol would need to
be updated as well as the P2P protocol.
Since an upload to a proxy can be distributed to multiple nodes, should
the proxy be able to indicate more than one url that the client
should upload to? Also the proxy might want an upload to still be sent to
it in addition to url(s). Of course the downside is that the client would
need to upload more than once, which eliminates one benefit of the proxy.
So it might be reasonable to only support one url, but what if the proxy
has multiple remotes that want to provide urls, how does it pick which one
wins?
Is only an URL enough for the client to be able to upload to wherever? It
may be that the HTTP verb is also necessary. Consider POST vs PUT. Some
services might need additional HTTP headers.