git-annex/doc/design/passthrough_proxy.mdwn

When [[balanced_preferred_content]] is used, there may be many repositories
in a location -- either a server or a cluster -- and getting any given file
may need to access any of them. Configuring remotes for each repository
adds a lot of complexity, both in setting up access controls on each
server, and for the user. 

Particularly on the user side, when ssh is used they may have to deal with
many different ssh host keys, as well as adding new remotes or removing
existing remotes to keep up with changes are made on the server side.

A proxy would avoid this complexity. It also allows limiting network
ingress to a single point.

Ideally a proxy would look like any other git-annex remote. All the files
stored anywhere in the cluster would be available to retrieve from the
proxy. When a file is sent to the proxy, it would store it somewhere in the
cluster.

Currently the closest git-annex can get to implementing such a proxy is a
transfer repository that wants all content that is not yet stored in the
cluster. This allows incoming transfers to be accepted and distributed to
nodes of the cluster. To get data back out of the cluster, there has to be
some communication that it is preferred content (eg, setting metadata),
then after some delay for it to be copied back to the transfer repository,
it becomes available for the client to download it. And once it knows the
client has its copy, it can be removed from the transfer repository.

That is quite slow, and rather clumsy. And it risks the transfer repository
filling up with data that has been requested by clients that have not yet
picked it up, or with incoming transfers that have not yet reached the
cluster.

A proxy would not hold the content of files itself. It would be a clone of
the git repository though, probably. Uploads and downloads would stream
through the proxy. The git-annex [[P2P_protocol]] could be relayed in this way. 

## UUID discovery

A significant difficulty in implementing a proxy is that each git-annex
remote has a single UUID. But the remote that points at the proxy can't
just have the UUID of the proxy's repository, git-annex needs to know that
the proxy's remote can be used to access repositories with every UUID in
the cluster.

### UUID discovery via P2P protocol extension

Could the P2P protocol be extended to let the proxy communicate the UUIDs
of all the repositories behind it?

Once the client git-annex knows the set of UUIDs behind the proxy, it could
eg instantiate a remote object per UUID, each of which accesses the proxy, but
with a different UUID.

But, git-annx usually only does UUID discovery the first time a ssh remote
is accessed. So it would need to discover at that point that the remote is
a proxy. Then it could do UUID discovery each time git-annex starts up.
But that adds significant overhead, git-annex would be making a connection
to the proxy in situations where it is not going to use it.

### UUID discovery via git-annex branch

Could the proxy's set of UUIDs instead be recorded somewhere in the
git-annex branch?

With this approach, git-annex would know as soon as it sees the proxy's
UUID that this is a proxy for this other set of UUIDS. (Unless its
git-annex branch is not up-to-date.)

One difficulty with this is that, when the git-annex branch is not up to
date with changes from the proxy, git-annex may try to access repositories
that are no longer available behind the proxy. That failure would be
handled the same as any other currently unavailable repository. Also
git-annex would not use the full set of repositories, so might not be able
to store data when eg, all the repositories that is knows about are full.
Just getting the git-annex back in sync should recover from either
situation.

## user interface

What to name the instantiated remotes? Probably the best that could
be done is to use the proxy's own remote names as suffixes on the client.
Eg, the proxy's "node1" remote is "proxy-node1".

But the user probably doesn't want to pick which node to send content to.
They don't necessarily know anything about the nodes. Ideally the user
would `git-annex copy --to proxy` or `git-annex push` and let it pick
which instantiated remote(s) to send to.

To make `git-annex copy --to proxy` work, `storeKey` could be changed to
allow returning a UUID (or UUIDs) where the content was actually stored.
That would also allow a single upload to the proxy to fan out and be stored
in multiple nodes. The proxy would use preferred content to pick which of
its nodes to store on.

Instantiated remotes would still be needed for `git-annex get` and similar
to work.

To make `git-annex copy --from proxy` work, the proxy would need to pick
a node and stream content from it. That's doable, but how to handle a case
where a node gets corrupted? The best it could do is mark that node as no
longer containing the content (as if a fsck failed) and try another one
next time. This complication might not be necessary. Consider that
while `git-annex copy --to foo` followed later by `git-annex copy --from foo`
will usually work, it doesn't work when eg first copying to a transfer
remote, which then sends the content elsewhere and drops its copy.

What about dropping? `git-annex drop --from proxy` could be made to work,
by having `removeKey` return a list of UUIDs that the content was dropped
from. What should that do if it's able to drop from some nodes but not
others? Perhaps it would need to be able to return a list of UUIDs that
content was dropped from but still indicate it overall failed to drop.
(Note that it's entirely possible that dropping from one node of the proxy
involves lockContent on another node of the proxy in order to satisfy
numcopies.)

A command like `git-annex push` would see all the instantiated remotes and
would pick one to send content to. Seems like the proxy might choose to
`storeKey` the content on other node(s) than the requested one. Which would
be fine. But, `git-annex push` would still do considerable extra work in
interating over all the instantiated remotes. So it might be better to make
such commands not operate on instantiated remotes for sending content but
only on the proxy. 

Commands like `git-annex push` and `git-annex pull`
should also skip the instantiated remotes when pushing or pulling the git
repo, because that would be extra work that accomplishes nothing.

## speed

A passthrough proxy should be as fast as possible so as not to add overhead
to a file retrieve, store, or checkpresent. This probably means that
it keeps TCP connections open to each host in the cluster. It might use a
protocol with less overhead than ssh.

In the case of checkpresent, it would be possible for the proxy to not
communicate with the cluster to check that the data is still present on it.
As long as all access is intermediated via the proxy, its git-annex branch
could be relied on to always be correct, in theory. Proving that theory,
making sure to account for all possible race conditions and other scenarios,
would be necessary for such an optimisation.

Another way the proxy could speed things up is to cache some subset of
content. Eg, analize what files are typically requested, and store another
copy of those on the proxy. Perhaps prioritize storing smaller files, where
latency tends to swamp transfer speed.

## streaming to special remotes

As well as being an intermediary to git-annex repositories, the proxy could
provide access to other special remotes. That could be an object store like
S3, which might be internal to the cluster or not. When using a cloud
service like S3, only the proxy needs to know the access credentials.

Currently git-annex does not support streaming content to special remotes.
The remote interface operates on object files stored on disk. See
[[todo/transitive_transfers]] for discussion of that problem. If proxies
get implemented, that problem should be revisited.

## encryption

When the proxy is in front of a special remote that uses encryption, where
does the encryption happen? It could either happen on the client before
sending to the proxy, or the proxy could do the encryption since it
communicates with the special remote. For security, doing the encryption on
the client seems like the best choice by far.

But, git-annex's git remotes don't currently ever do encryption. And
special remotes don't communicate via the P2P protocol with a git remote.
So none of git-annex's existing remote implementations would be able to handle
this case. Something will need to be changed in the remote
implementation for this.

(Chunking has the same problem.)

There's potentially a layering problem here, because exactly how encryption
(or chunking) works can vary depending on the type of special remote.

## exporttree=yes

Could the proxy be in front of a special remote that uses exporttree=yes?

Some possible approaches:

* Proxy caches files until all the files in the configured
  annex-tracking-branch are available, then exports them all to the special
  remote. Not ideal at all.
* Proxy exports each file to the special remote as it is received.
  It records an incomplete tree export after each export.
  Once all files in the configured annex-tracking-branch have been sent,
  it records a completed tree export. This seems possible, it's similar
  to `git-annex export --to=remote` recovering after having been
  interrupted.
* Proxy storeExport and all related export/import actions. This would need
  a large expansion of the P2P protocol.

The first two approaches need some way to communicate the
configured annex-tracking-branch over the P2P protocol. Or to communicate
the tree that it currently points to.

The first two approaches also have a complication when a key is sent to
the proxy that is not part of the configured annex-tracking-branch. What
does the proxy do with it?
todo 2024-03-13 14:19:10 +00:00			`When [[balanced_preferred_content]] is used, there may be many repositories`
			`in a location -- either a server or a cluster -- and getting any given file`
			`may need to access any of them. Configuring remotes for each repository`
			`adds a lot of complexity, both in setting up access controls on each`
			`server, and for the user.`

			`Particularly on the user side, when ssh is used they may have to deal with`
			`many different ssh host keys, as well as adding new remotes or removing`
			`existing remotes to keep up with changes are made on the server side.`

			`A proxy would avoid this complexity. It also allows limiting network`
			`ingress to a single point.`

			`Ideally a proxy would look like any other git-annex remote. All the files`
			`stored anywhere in the cluster would be available to retrieve from the`
			`proxy. When a file is sent to the proxy, it would store it somewhere in the`
			`cluster.`

			`Currently the closest git-annex can get to implementing such a proxy is a`
			`transfer repository that wants all content that is not yet stored in the`
			`cluster. This allows incoming transfers to be accepted and distributed to`
			`nodes of the cluster. To get data back out of the cluster, there has to be`
			`some communication that it is preferred content (eg, setting metadata),`
			`then after some delay for it to be copied back to the transfer repository,`
			`it becomes available for the client to download it. And once it knows the`
			`client has its copy, it can be removed from the transfer repository.`

			`That is quite slow, and rather clumsy. And it risks the transfer repository`
			`filling up with data that has been requested by clients that have not yet`
			`picked it up, or with incoming transfers that have not yet reached the`
			`cluster.`

			`A proxy would not hold the content of files itself. It would be a clone of`
			`the git repository though, probably. Uploads and downloads would stream`
			`through the proxy. The git-annex [[P2P_protocol]] could be relayed in this way.`

update 2024-03-13 14:32:03 +00:00			`## UUID discovery`
todo 2024-03-13 14:19:10 +00:00
update 2024-03-13 14:32:03 +00:00			`A significant difficulty in implementing a proxy is that each git-annex`
			`remote has a single UUID. But the remote that points at the proxy can't`
			`just have the UUID of the proxy's repository, git-annex needs to know that`
			`the proxy's remote can be used to access repositories with every UUID in`
			`the cluster.`
todo 2024-03-13 14:19:10 +00:00
update 2024-03-13 14:32:03 +00:00			`### UUID discovery via P2P protocol extension`
todo 2024-03-13 14:19:10 +00:00
			`Could the P2P protocol be extended to let the proxy communicate the UUIDs`
			`of all the repositories behind it?`

additional design work on proxies Sponsored-by: Dartmouth College's OpenNeuro project 2024-05-01 15:08:10 +00:00			`Once the client git-annex knows the set of UUIDs behind the proxy, it could`
			`eg instantiate a remote object per UUID, each of which accesses the proxy, but`
todo 2024-03-13 14:19:10 +00:00			`with a different UUID.`

			`But, git-annx usually only does UUID discovery the first time a ssh remote`
			`is accessed. So it would need to discover at that point that the remote is`
			`a proxy. Then it could do UUID discovery each time git-annex starts up.`
			`But that adds significant overhead, git-annex would be making a connection`
			`to the proxy in situations where it is not going to use it.`

update 2024-03-13 14:32:03 +00:00			`### UUID discovery via git-annex branch`
todo 2024-03-13 14:19:10 +00:00
			`Could the proxy's set of UUIDs instead be recorded somewhere in the`
			`git-annex branch?`

			`With this approach, git-annex would know as soon as it sees the proxy's`
			`UUID that this is a proxy for this other set of UUIDS. (Unless its`
additional design work on proxies Sponsored-by: Dartmouth College's OpenNeuro project 2024-05-01 15:08:10 +00:00			`git-annex branch is not up-to-date.)`
todo 2024-03-13 14:19:10 +00:00
			`One difficulty with this is that, when the git-annex branch is not up to`
			`date with changes from the proxy, git-annex may try to access repositories`
			`that are no longer available behind the proxy. That failure would be`
			`handled the same as any other currently unavailable repository. Also`
			`git-annex would not use the full set of repositories, so might not be able`
			`to store data when eg, all the repositories that is knows about are full.`
			`Just getting the git-annex back in sync should recover from either`
			`situation.`

additional design work on proxies Sponsored-by: Dartmouth College's OpenNeuro project 2024-05-01 15:08:10 +00:00			`## user interface`

			`What to name the instantiated remotes? Probably the best that could`
			`be done is to use the proxy's own remote names as suffixes on the client.`
			`Eg, the proxy's "node1" remote is "proxy-node1".`

			`But the user probably doesn't want to pick which node to send content to.`
			`They don't necessarily know anything about the nodes. Ideally the user`
			would `git-annex copy --to proxy` or `git-annex push` and let it pick
			`which instantiated remote(s) to send to.`

			To make `git-annex copy --to proxy` work, `storeKey` could be changed to
			`allow returning a UUID (or UUIDs) where the content was actually stored.`
			`That would also allow a single upload to the proxy to fan out and be stored`
			`in multiple nodes. The proxy would use preferred content to pick which of`
			`its nodes to store on.`

			Instantiated remotes would still be needed for `git-annex get` and similar
			`to work.`

			To make `git-annex copy --from proxy` work, the proxy would need to pick
			`a node and stream content from it. That's doable, but how to handle a case`
			`where a node gets corrupted? The best it could do is mark that node as no`
			`longer containing the content (as if a fsck failed) and try another one`
			`next time. This complication might not be necessary. Consider that`
			while `git-annex copy --to foo` followed later by `git-annex copy --from foo`
			`will usually work, it doesn't work when eg first copying to a transfer`
			`remote, which then sends the content elsewhere and drops its copy.`

			What about dropping? `git-annex drop --from proxy` could be made to work,
			by having `removeKey` return a list of UUIDs that the content was dropped
			`from. What should that do if it's able to drop from some nodes but not`
			`others? Perhaps it would need to be able to return a list of UUIDs that`
			`content was dropped from but still indicate it overall failed to drop.`
			`(Note that it's entirely possible that dropping from one node of the proxy`
			`involves lockContent on another node of the proxy in order to satisfy`
			`numcopies.)`

			A command like `git-annex push` would see all the instantiated remotes and
			`would pick one to send content to. Seems like the proxy might choose to`
			`storeKey` the content on other node(s) than the requested one. Which would
			be fine. But, `git-annex push` would still do considerable extra work in
			`interating over all the instantiated remotes. So it might be better to make`
			`such commands not operate on instantiated remotes for sending content but`
			`only on the proxy.`

			Commands like `git-annex push` and `git-annex pull`
			`should also skip the instantiated remotes when pushing or pulling the git`
			`repo, because that would be extra work that accomplishes nothing.`

update 2024-03-13 14:29:48 +00:00			`## speed`

			`A passthrough proxy should be as fast as possible so as not to add overhead`
			`to a file retrieve, store, or checkpresent. This probably means that`
			`it keeps TCP connections open to each host in the cluster. It might use a`
			`protocol with less overhead than ssh.`

			`In the case of checkpresent, it would be possible for the proxy to not`
			`communicate with the cluster to check that the data is still present on it.`
			`As long as all access is intermediated via the proxy, its git-annex branch`
			`could be relied on to always be correct, in theory. Proving that theory,`
			`making sure to account for all possible race conditions and other scenarios,`
			`would be necessary for such an optimisation.`

			`Another way the proxy could speed things up is to cache some subset of`
			`content. Eg, analize what files are typically requested, and store another`
			`copy of those on the proxy. Perhaps prioritize storing smaller files, where`
			`latency tends to swamp transfer speed.`

design work on proxies for exporttree=yes Sponsored-by: Dartmouth College's OpenNeuro project 2024-05-01 16:07:57 +00:00			`## streaming to special remotes`

			`As well as being an intermediary to git-annex repositories, the proxy could`
			`provide access to other special remotes. That could be an object store like`
			`S3, which might be internal to the cluster or not. When using a cloud`
			`service like S3, only the proxy needs to know the access credentials.`

			`Currently git-annex does not support streaming content to special remotes.`
			`The remote interface operates on object files stored on disk. See`
			`[[todo/transitive_transfers]] for discussion of that problem. If proxies`
			`get implemented, that problem should be revisited.`

update 2024-03-13 15:19:04 +00:00			`## encryption`

			`When the proxy is in front of a special remote that uses encryption, where`
			`does the encryption happen? It could either happen on the client before`
			`sending to the proxy, or the proxy could do the encryption since it`
			`communicates with the special remote. For security, doing the encryption on`
			`the client seems like the best choice by far.`

			`But, git-annex's git remotes don't currently ever do encryption. And`
			`special remotes don't communicate via the P2P protocol with a git remote.`
update 2024-03-13 15:21:05 +00:00			`So none of git-annex's existing remote implementations would be able to handle`
			`this case. Something will need to be changed in the remote`
			`implementation for this.`
update 2024-03-13 15:19:04 +00:00
			`(Chunking has the same problem.)`
update 2024-03-13 15:21:05 +00:00
			`There's potentially a layering problem here, because exactly how encryption`
			`(or chunking) works can vary depending on the type of special remote.`
design work on proxies for exporttree=yes Sponsored-by: Dartmouth College's OpenNeuro project 2024-05-01 16:07:57 +00:00
			`## exporttree=yes`

			`Could the proxy be in front of a special remote that uses exporttree=yes?`

			`Some possible approaches:`

			`* Proxy caches files until all the files in the configured`
			`annex-tracking-branch are available, then exports them all to the special`
			`remote. Not ideal at all.`
			`* Proxy exports each file to the special remote as it is received.`
			`It records an incomplete tree export after each export.`
			`Once all files in the configured annex-tracking-branch have been sent,`
			`it records a completed tree export. This seems possible, it's similar`
			to `git-annex export --to=remote` recovering after having been
			`interrupted.`
			`* Proxy storeExport and all related export/import actions. This would need`
			`a large expansion of the P2P protocol.`

			`The first two approaches need some way to communicate the`
			`configured annex-tracking-branch over the P2P protocol. Or to communicate`
			`the tree that it currently points to.`

			`The first two approaches also have a complication when a key is sent to`
			`the proxy that is not part of the configured annex-tracking-branch. What`
			`does the proxy do with it?`