todo
This commit is contained in:
parent
eaf451c129
commit
259061c444
1 changed files with 90 additions and 0 deletions
90
doc/design/passthrouh_proxy.mdwn
Normal file
90
doc/design/passthrouh_proxy.mdwn
Normal file
|
@ -0,0 +1,90 @@
|
||||||
|
When [[balanced_preferred_content]] is used, there may be many repositories
|
||||||
|
in a location -- either a server or a cluster -- and getting any given file
|
||||||
|
may need to access any of them. Configuring remotes for each repository
|
||||||
|
adds a lot of complexity, both in setting up access controls on each
|
||||||
|
server, and for the user.
|
||||||
|
|
||||||
|
Particularly on the user side, when ssh is used they may have to deal with
|
||||||
|
many different ssh host keys, as well as adding new remotes or removing
|
||||||
|
existing remotes to keep up with changes are made on the server side.
|
||||||
|
|
||||||
|
A proxy would avoid this complexity. It also allows limiting network
|
||||||
|
ingress to a single point.
|
||||||
|
|
||||||
|
Ideally a proxy would look like any other git-annex remote. All the files
|
||||||
|
stored anywhere in the cluster would be available to retrieve from the
|
||||||
|
proxy. When a file is sent to the proxy, it would store it somewhere in the
|
||||||
|
cluster.
|
||||||
|
|
||||||
|
Currently the closest git-annex can get to implementing such a proxy is a
|
||||||
|
transfer repository that wants all content that is not yet stored in the
|
||||||
|
cluster. This allows incoming transfers to be accepted and distributed to
|
||||||
|
nodes of the cluster. To get data back out of the cluster, there has to be
|
||||||
|
some communication that it is preferred content (eg, setting metadata),
|
||||||
|
then after some delay for it to be copied back to the transfer repository,
|
||||||
|
it becomes available for the client to download it. And once it knows the
|
||||||
|
client has its copy, it can be removed from the transfer repository.
|
||||||
|
|
||||||
|
That is quite slow, and rather clumsy. And it risks the transfer repository
|
||||||
|
filling up with data that has been requested by clients that have not yet
|
||||||
|
picked it up, or with incoming transfers that have not yet reached the
|
||||||
|
cluster.
|
||||||
|
|
||||||
|
A proxy would not hold the content of files itself. It would be a clone of
|
||||||
|
the git repository though, probably. Uploads and downloads would stream
|
||||||
|
through the proxy. The git-annex [[P2P_protocol]] could be relayed in this way.
|
||||||
|
|
||||||
|
## discovering UUIDS
|
||||||
|
|
||||||
|
A significant difficulty in implementing a proxy for the P2P protocol is
|
||||||
|
that each git-annex remote has a single UUID. But the remote that points at
|
||||||
|
the proxy can't just have the UUID of the proxy's repository, git-annex
|
||||||
|
needs to know that the remote can be used to access repositories with every
|
||||||
|
UUID in the cluster.
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
Could the P2P protocol be extended to let the proxy communicate the UUIDs
|
||||||
|
of all the repositories behind it?
|
||||||
|
|
||||||
|
Once the client git-annex knows the set of UUIDs behind the proxy, it can
|
||||||
|
instantiate a remote object per uuid, each of which accesses the proxy, but
|
||||||
|
with a different UUID.
|
||||||
|
|
||||||
|
But, git-annx usually only does UUID discovery the first time a ssh remote
|
||||||
|
is accessed. So it would need to discover at that point that the remote is
|
||||||
|
a proxy. Then it could do UUID discovery each time git-annex starts up.
|
||||||
|
But that adds significant overhead, git-annex would be making a connection
|
||||||
|
to the proxy in situations where it is not going to use it.
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
Could the proxy's set of UUIDs instead be recorded somewhere in the
|
||||||
|
git-annex branch?
|
||||||
|
|
||||||
|
With this approach, git-annex would know as soon as it sees the proxy's
|
||||||
|
UUID that this is a proxy for this other set of UUIDS. (Unless its
|
||||||
|
git-annex branch is not up-to-date.) And then it can instantiate a UUID for
|
||||||
|
each remote.
|
||||||
|
|
||||||
|
One difficulty with this is that, when the git-annex branch is not up to
|
||||||
|
date with changes from the proxy, git-annex may try to access repositories
|
||||||
|
that are no longer available behind the proxy. That failure would be
|
||||||
|
handled the same as any other currently unavailable repository. Also
|
||||||
|
git-annex would not use the full set of repositories, so might not be able
|
||||||
|
to store data when eg, all the repositories that is knows about are full.
|
||||||
|
Just getting the git-annex back in sync should recover from either
|
||||||
|
situation.
|
||||||
|
|
||||||
|
## streaming to special remotes
|
||||||
|
|
||||||
|
As well as being an intermediary to git-annex repositories, the proxy could
|
||||||
|
provide access to other special remotes. That could be an object store like
|
||||||
|
S3, which might be internal to the cluster or not. When using a cloud
|
||||||
|
service like S3, only the proxy needs to know the access credentials.
|
||||||
|
|
||||||
|
Currently git-annex does not support streaming content to special remotes.
|
||||||
|
The remote interface operates on object files stored on disk. See
|
||||||
|
[[todo/transitive_transfers]] for discussion of that problem. If proxies
|
||||||
|
get implemented, that problem should be revisited.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue