git-annex/doc/design/p2p_protocol_over_http.mdwn
2024-07-05 10:08:43 -04:00

153 lines
6.4 KiB
Markdown

[[!toc ]]
## motivation
The [[P2P protocol]] is a custom protocol that git-annex speaks over a ssh
connection (mostly). This is a design working on supporting the P2P
protocol over HTTP.
Upload of annex objects to git remotes that use http is currently not
supported by git-annex, and this would be a generally very useful addition.
For use cases such as OpenNeuro's javascript client, ssh is too difficult
to support, so they currently use a special remote that talks to a http
endpoint in order to upload objects. Implementing this would let them
talk to git-annex over http.
With the [[passthrough_proxy]], this would let clients configure a single
http remote that accesses a more complicated network of git-annex
repositories.
## integration with git
A webserver that is configured to serve a git repository either serves the
files in the repository with dumb http, or uses the git-http-backend CGI
program for url paths under eg `/git/`.
To integrate with that, git-annex would need a git-annex-http-backend CGI
program, that the webserver is configured to run for url paths under
`/git/.*/annex/`.
So, for a remote with an url `http://example.com/git/foo`, git-annex would
use paths under `http://example.com/git/foo/annex/` to run its CGI.
But, the CGI interface is a poor match for the P2P protocol.
A particular problem is that `LOCKCONTENT` would need to be in one CGI
request, followed by another request to `UNLOCKCONTENT`. Unless
git-annex-http-backend forked a daemon to keep the content locked, it would
not be able to retain a file lock across the 2 requests. While the 10
minute retention lock would paper over that, UNLOCKCONTENT would not be
able to delete the retention lock, because there is no way to know if
another LOCKCONTENT was received later. So LOCKCONTENT would always lock
content for 10 minutes. Which would result in some undesirable behaviors.
Another problem is with proxies and clusters. The CGI would need to open
ssh (or http) connections to the proxied repositories and cluster nodes
each time it is run. That would add a lot of latency to every request.
And running a git-annex process once per CGI request also makes git-annex's
own startup speed, which is ok but not great, add latency. And each time
the CGI needed to change the git-annex branch, it would have to commit on
shutdown. Lots of time and space optimisations would be prevented by using
the CGI interface.
So, rather than having the CGI program do anything in the repository
itself, have it pass each request through to a long-running server.
(This does have the downside that files would get double-copied
through the CGI, which adds some overhead.)
A reasonable way to do that would be to have a webserver speaking a
HTTP version of the git-annex P2P protocol and the CGI just talks to that.
The CGI program then becomes tiny, and just needs to know the url to
connect to the git-annex HTTP server.
Alternatively, a remote's configuration could include that url, and
then we don't need the complication and overhead of the CGI program at all.
Eg:
git config remote.origin.annex-url http://example.com:8080/
So, the rest of this design will focus on implementing that. The CGI
program can be added later if desired, so avoid users needing to configure
an additional thing.
Note that, one nice benefit of having a separate annex-url is it allows
having remote.origin.url on eg github, but with an annex-url configured
that remote can also be used as a git-annex repository.
## approach 1: websockets
The client connects to the server over a websocket. From there on,
the protocol is encapsulated in websockets.
This seems nice and simple to implement, but not very web native. Anyone
wanting to talk to this web server would need to understand the P2P
protocol. Just to upload a file would need to deal with AUTH,
AUTH-SUCCESS, AUTH-FAILURE, VERSION, PUT, ALREADY-HAVE, PUT-FROM, DATA,
INVALID, VALID, SUCCESS, and FAILURE messages. Seems like a lot.
Some requests like `LOCKCONTENT` do need full duplex communication like
websockets provide. But, it might be more web native to only use websockets
for that request, and not for everything.
## approach 2: web-native API
Another approach is to define a web-native API with endpoints that
correspond to each action in the P2P protocol.
Something like this:
> POST /git-annex/v1/AUTH?clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925 HTTP/1.0
< AUTH-SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6
> POST /git-annex/v1/CHECKPRESENT?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
> SUCCESS
> POST /git-annex/v1/PUT-FROM?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
< PUT-FROM 0
> POST /git-annex/v1/PUT?key=SHA1--foo&associatedfile=bar&put-from=0&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
> Content-Type: application/octet-stream
> Content-Length: 20
> foo
> {"valid": true}
< {"stored": true}
(In the last example above "foo" is the content, it is followed by a line of json.
This seems better than needing an entire other request to indicate validitity.)
This needs a more complex spec. But it's easier for others to implement,
especially since it does not need a session identifier, so the HTTP server can
be stateless.
A full draft protocol for this is being developed at [[p2p_protocol_over_http/draft1]].
## HTTP GET
It should be possible to support a regular HTTP get of a key, with
no additional parameters, so that annex objects can be served to other clients
from this web server.
> GET /git-annex/key/SHA1--foo HTTP/1.0
< foo
Although this would be a special case, not used by git-annex, because the P2P
protocol's GET has the complication of offsets, and of the server sending
VALID/INVALID after the content, and of needing to know the client's UUID in
order to update the location log.
## Problem: CONNECT
The CONNECT message allows both sides of the P2P protocol to send DATA
messages in any order. This seems difficult to encapsulate in HTTP.
Probably this can be not implemented, it's probably not needed for a HTTP
remote? This is used to tunnel git protocol over the P2P protocol, but for
a HTTP remote the git repository can be accessed over HTTP as well.
## security
Should support HTTPS and/or be limited to only HTTPS.
Authentication via http basic auth?