started on a design for P2P protocol over HTTP

Added to git-annex_proxies todo because this is something OpenNeuro
would need in order to use the git-annex proxy.

Sponsored-by: Dartmouth College's OpenNeuro project
This commit is contained in:
Joey Hess 2024-05-01 15:26:51 -04:00
parent d28adebd6b
commit cbaf2172ab
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 139 additions and 5 deletions

View file

@ -72,7 +72,7 @@ on its own line, followed by a newline and the binary data.
The Len value tells how many bytes of data to read.
DATA 3
foo1
foo
Note that there is no newline after the binary data; the next protocol
message will come immediately after it.

View file

@ -0,0 +1,125 @@
[[!toc ]]
## motivation
The [[P2P protocol]] is a custom protocol that git-annex speaks over a ssh
connection (mostly). This is a design working on supporting the P2P
protocol over HTTP.
Upload of annex objects to git remotes that use http is currently not
supported by git-annex, and this would be a generally very useful addition.
For use cases such as OpenNeuro's javascript client, ssh is too difficult
to support, so they currently use a special remote that talks to a http
endpoint in order to upload objects. Implementing this would let them
talk to git-annex over http.
With the [[passthrough_proxy]], this would let clients configure a single
http remote that accesses a more complicated network of git-annex
repositories.
## approach 1: encapsulation
One approach is to encapsulate the P2P protocol inside HTTP. This has the
benefit of being simple to think about. It is not very web-native though.
There would be a single API endpoint. The client connects and sends a
request that encapsulates one or more lines in the P2P protocol. The server
sends a response that encapsulates one or more lines in the P2P
protocol.
For example (eliding the full HTTP responses, only showing the data):
> POST /git-annex HTTP/1.0
> Content-Type: x-git-annex-p2p
> Content-Length: ...
>
> AUTH 79a5a1f4-07e8-11ef-873d-97f93ca91925
< AUTH_SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6
> POST /git-annex HTTP/1.0
> Content-Type: x-git-annex-p2p
> Content-Length: ...
>
> VERSION 1
< VERSION 1
> POST /git-annex HTTP/1.0
> Content-Type: x-git-annex-p2p
> Content-Length: ...
>
> CHECKPRESENT SHA1--foo
< SUCCESS
> POST /git-annex HTTP/1.0
> Content-Type: x-git-annex-p2p
> Content-Length: ...
>
> PUT bar SHA1--bar
< PUT-FROM 0
> POST /git-annex HTTP/1.0
> Content-Type: x-git-annex-p2p
> Content-Length: ...
>
> DATA 3
> foo
> VALID
< SUCCESS
Note that, since VERSION is negotiated in one request, the HTTP server
needs to know that a series of requests are part of the same P2P protocol
session. In the example above, it would not have a good way to do that.
One solution would be to add a session identifier UUID to each request.
## approach 2: HTTP API
Another approach is to define a web-native API with endpoints that
correspond to each action in the P2P protocol.
Something like this:
> GET /git-annex/v1/AUTH?clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925 HTTP/1.0
< AUTH_SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6
> GET /git-annex/v1/CHECKPRESENT?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
> SUCCESS
> GET /git-annex/v1/PUT-FROM?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
< PUT-FROM 0
> POST /git-annex/v1/PUT?key=SHA1--foo&associatedfile=bar&put-from=0&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
> Content-Type: application/octet-stream
> Content-Length: 4
> foo1
< SUCCESS
(In the last example above "foo" is the content, there is an additional byte at the end that
is 1 for VALID and 0 for INVALID. This seems better than needing an entire
other request to indicate validitity.)
This needs a more complex spec. But it's easier for others to implement,
especially since it does not need a session identifier, so the HTTP server can
be stateless.
## HTTP GET
It should be possible to support a regular HTTP get of a key, with
no additional parameters, so that annex objects can be served to other clients
from this web server.
> GET /git-annex/key/SHA1--foo HTTP/1.0
< foo
Although this would be a special case, not used by git-annex, because the P2P
protocol's GET has the complication of offsets, and of the server sending
VALID/INVALID after the content, and of needing to know the client's UUID in
order to update the location log.
## Problem: CONNECT
The CONNECT message allows both sides of the P2P protocol to send DATA
messages in any order. This seems difficult to encapsulate in HTTP.
Probably this can be not implemented, it's probably not needed for a HTTP
remote?

View file

@ -36,7 +36,15 @@ cluster.
A proxy would not hold the content of files itself. It would be a clone of
the git repository though, probably. Uploads and downloads would stream
through the proxy. The git-annex [[P2P_protocol]] could be relayed in this way.
through the proxy.
## protocol
The git-annex [[P2P_protocol]] would be relayed via the proxy,
which would be a regular git ssh remote.
There is also the possibility of relaying the P2P protocol over another
protocol such as HTTP, see [[P2P_protocol_over_http]].
## UUID discovery

View file

@ -3,8 +3,9 @@ git-annex to be able to use proxies which sit in front of a cluster of
repositories.
1. [[design/passthrough_proxy]]
2. [[design/balanced_preferred_content]]
3. [[todo/track_free_space_in_repos_via_git-annex_branch]]
4. [[todo/proving_preferred_content_behavior]]
2. [[design/p2p_protocol_over_http]]
3. [[design/balanced_preferred_content]]
4. [[todo/track_free_space_in_repos_via_git-annex_branch]]
5. [[todo/proving_preferred_content_behavior]]
[[!tag projects/openneuro]]