2024-03-13 14:19:10 +00:00
|
|
|
When [[balanced_preferred_content]] is used, there may be many repositories
|
|
|
|
in a location -- either a server or a cluster -- and getting any given file
|
|
|
|
may need to access any of them. Configuring remotes for each repository
|
|
|
|
adds a lot of complexity, both in setting up access controls on each
|
|
|
|
server, and for the user.
|
|
|
|
|
|
|
|
Particularly on the user side, when ssh is used they may have to deal with
|
|
|
|
many different ssh host keys, as well as adding new remotes or removing
|
|
|
|
existing remotes to keep up with changes are made on the server side.
|
|
|
|
|
|
|
|
A proxy would avoid this complexity. It also allows limiting network
|
|
|
|
ingress to a single point.
|
|
|
|
|
|
|
|
Ideally a proxy would look like any other git-annex remote. All the files
|
|
|
|
stored anywhere in the cluster would be available to retrieve from the
|
|
|
|
proxy. When a file is sent to the proxy, it would store it somewhere in the
|
|
|
|
cluster.
|
|
|
|
|
|
|
|
Currently the closest git-annex can get to implementing such a proxy is a
|
|
|
|
transfer repository that wants all content that is not yet stored in the
|
|
|
|
cluster. This allows incoming transfers to be accepted and distributed to
|
|
|
|
nodes of the cluster. To get data back out of the cluster, there has to be
|
|
|
|
some communication that it is preferred content (eg, setting metadata),
|
|
|
|
then after some delay for it to be copied back to the transfer repository,
|
|
|
|
it becomes available for the client to download it. And once it knows the
|
|
|
|
client has its copy, it can be removed from the transfer repository.
|
|
|
|
|
|
|
|
That is quite slow, and rather clumsy. And it risks the transfer repository
|
|
|
|
filling up with data that has been requested by clients that have not yet
|
|
|
|
picked it up, or with incoming transfers that have not yet reached the
|
|
|
|
cluster.
|
|
|
|
|
|
|
|
A proxy would not hold the content of files itself. It would be a clone of
|
|
|
|
the git repository though, probably. Uploads and downloads would stream
|
|
|
|
through the proxy. The git-annex [[P2P_protocol]] could be relayed in this way.
|
|
|
|
|
2024-03-13 14:32:03 +00:00
|
|
|
## UUID discovery
|
2024-03-13 14:19:10 +00:00
|
|
|
|
2024-03-13 14:32:03 +00:00
|
|
|
A significant difficulty in implementing a proxy is that each git-annex
|
|
|
|
remote has a single UUID. But the remote that points at the proxy can't
|
|
|
|
just have the UUID of the proxy's repository, git-annex needs to know that
|
|
|
|
the proxy's remote can be used to access repositories with every UUID in
|
|
|
|
the cluster.
|
2024-03-13 14:19:10 +00:00
|
|
|
|
2024-03-13 14:32:03 +00:00
|
|
|
### UUID discovery via P2P protocol extension
|
2024-03-13 14:19:10 +00:00
|
|
|
|
|
|
|
Could the P2P protocol be extended to let the proxy communicate the UUIDs
|
|
|
|
of all the repositories behind it?
|
|
|
|
|
|
|
|
Once the client git-annex knows the set of UUIDs behind the proxy, it can
|
|
|
|
instantiate a remote object per uuid, each of which accesses the proxy, but
|
|
|
|
with a different UUID.
|
|
|
|
|
|
|
|
But, git-annx usually only does UUID discovery the first time a ssh remote
|
|
|
|
is accessed. So it would need to discover at that point that the remote is
|
|
|
|
a proxy. Then it could do UUID discovery each time git-annex starts up.
|
|
|
|
But that adds significant overhead, git-annex would be making a connection
|
|
|
|
to the proxy in situations where it is not going to use it.
|
|
|
|
|
2024-03-13 14:32:03 +00:00
|
|
|
### UUID discovery via git-annex branch
|
2024-03-13 14:19:10 +00:00
|
|
|
|
|
|
|
Could the proxy's set of UUIDs instead be recorded somewhere in the
|
|
|
|
git-annex branch?
|
|
|
|
|
|
|
|
With this approach, git-annex would know as soon as it sees the proxy's
|
|
|
|
UUID that this is a proxy for this other set of UUIDS. (Unless its
|
|
|
|
git-annex branch is not up-to-date.) And then it can instantiate a UUID for
|
|
|
|
each remote.
|
|
|
|
|
|
|
|
One difficulty with this is that, when the git-annex branch is not up to
|
|
|
|
date with changes from the proxy, git-annex may try to access repositories
|
|
|
|
that are no longer available behind the proxy. That failure would be
|
|
|
|
handled the same as any other currently unavailable repository. Also
|
|
|
|
git-annex would not use the full set of repositories, so might not be able
|
|
|
|
to store data when eg, all the repositories that is knows about are full.
|
|
|
|
Just getting the git-annex back in sync should recover from either
|
|
|
|
situation.
|
|
|
|
|
|
|
|
## streaming to special remotes
|
|
|
|
|
|
|
|
As well as being an intermediary to git-annex repositories, the proxy could
|
|
|
|
provide access to other special remotes. That could be an object store like
|
|
|
|
S3, which might be internal to the cluster or not. When using a cloud
|
|
|
|
service like S3, only the proxy needs to know the access credentials.
|
|
|
|
|
|
|
|
Currently git-annex does not support streaming content to special remotes.
|
|
|
|
The remote interface operates on object files stored on disk. See
|
|
|
|
[[todo/transitive_transfers]] for discussion of that problem. If proxies
|
|
|
|
get implemented, that problem should be revisited.
|
|
|
|
|
2024-03-13 14:29:48 +00:00
|
|
|
## speed
|
|
|
|
|
|
|
|
A passthrough proxy should be as fast as possible so as not to add overhead
|
|
|
|
to a file retrieve, store, or checkpresent. This probably means that
|
|
|
|
it keeps TCP connections open to each host in the cluster. It might use a
|
|
|
|
protocol with less overhead than ssh.
|
|
|
|
|
|
|
|
In the case of checkpresent, it would be possible for the proxy to not
|
|
|
|
communicate with the cluster to check that the data is still present on it.
|
|
|
|
As long as all access is intermediated via the proxy, its git-annex branch
|
|
|
|
could be relied on to always be correct, in theory. Proving that theory,
|
|
|
|
making sure to account for all possible race conditions and other scenarios,
|
|
|
|
would be necessary for such an optimisation.
|
|
|
|
|
|
|
|
Another way the proxy could speed things up is to cache some subset of
|
|
|
|
content. Eg, analize what files are typically requested, and store another
|
|
|
|
copy of those on the proxy. Perhaps prioritize storing smaller files, where
|
|
|
|
latency tends to swamp transfer speed.
|
|
|
|
|