153 lines
6.4 KiB
Markdown
153 lines
6.4 KiB
Markdown
[[!toc ]]
|
|
|
|
## motivation
|
|
|
|
The [[P2P protocol]] is a custom protocol that git-annex speaks over a ssh
|
|
connection (mostly). This is a design working on supporting the P2P
|
|
protocol over HTTP.
|
|
|
|
Upload of annex objects to git remotes that use http is currently not
|
|
supported by git-annex, and this would be a generally very useful addition.
|
|
|
|
For use cases such as OpenNeuro's javascript client, ssh is too difficult
|
|
to support, so they currently use a special remote that talks to a http
|
|
endpoint in order to upload objects. Implementing this would let them
|
|
talk to git-annex over http.
|
|
|
|
With the [[passthrough_proxy]], this would let clients configure a single
|
|
http remote that accesses a more complicated network of git-annex
|
|
repositories.
|
|
|
|
## integration with git
|
|
|
|
A webserver that is configured to serve a git repository either serves the
|
|
files in the repository with dumb http, or uses the git-http-backend CGI
|
|
program for url paths under eg `/git/`.
|
|
|
|
To integrate with that, git-annex would need a git-annex-http-backend CGI
|
|
program, that the webserver is configured to run for url paths under
|
|
`/git/.*/annex/`.
|
|
|
|
So, for a remote with an url `http://example.com/git/foo`, git-annex would
|
|
use paths under `http://example.com/git/foo/annex/` to run its CGI.
|
|
|
|
But, the CGI interface is a poor match for the P2P protocol.
|
|
|
|
A particular problem is that `LOCKCONTENT` would need to be in one CGI
|
|
request, followed by another request to `UNLOCKCONTENT`. Unless
|
|
git-annex-http-backend forked a daemon to keep the content locked, it would
|
|
not be able to retain a file lock across the 2 requests. While the 10
|
|
minute retention lock would paper over that, UNLOCKCONTENT would not be
|
|
able to delete the retention lock, because there is no way to know if
|
|
another LOCKCONTENT was received later. So LOCKCONTENT would always lock
|
|
content for 10 minutes. Which would result in some undesirable behaviors.
|
|
|
|
Another problem is with proxies and clusters. The CGI would need to open
|
|
ssh (or http) connections to the proxied repositories and cluster nodes
|
|
each time it is run. That would add a lot of latency to every request.
|
|
|
|
And running a git-annex process once per CGI request also makes git-annex's
|
|
own startup speed, which is ok but not great, add latency. And each time
|
|
the CGI needed to change the git-annex branch, it would have to commit on
|
|
shutdown. Lots of time and space optimisations would be prevented by using
|
|
the CGI interface.
|
|
|
|
So, rather than having the CGI program do anything in the repository
|
|
itself, have it pass each request through to a long-running server.
|
|
(This does have the downside that files would get double-copied
|
|
through the CGI, which adds some overhead.)
|
|
A reasonable way to do that would be to have a webserver speaking a
|
|
HTTP version of the git-annex P2P protocol and the CGI just talks to that.
|
|
|
|
The CGI program then becomes tiny, and just needs to know the url to
|
|
connect to the git-annex HTTP server.
|
|
|
|
Alternatively, a remote's configuration could include that url, and
|
|
then we don't need the complication and overhead of the CGI program at all.
|
|
Eg:
|
|
|
|
git config remote.origin.annex-url http://example.com:8080/
|
|
|
|
So, the rest of this design will focus on implementing that. The CGI
|
|
program can be added later if desired, so avoid users needing to configure
|
|
an additional thing.
|
|
|
|
Note that, one nice benefit of having a separate annex-url is it allows
|
|
having remote.origin.url on eg github, but with an annex-url configured
|
|
that remote can also be used as a git-annex repository.
|
|
|
|
## approach 1: websockets
|
|
|
|
The client connects to the server over a websocket. From there on,
|
|
the protocol is encapsulated in websockets.
|
|
|
|
This seems nice and simple to implement, but not very web native. Anyone
|
|
wanting to talk to this web server would need to understand the P2P
|
|
protocol. Just to upload a file would need to deal with AUTH,
|
|
AUTH-SUCCESS, AUTH-FAILURE, VERSION, PUT, ALREADY-HAVE, PUT-FROM, DATA,
|
|
INVALID, VALID, SUCCESS, and FAILURE messages. Seems like a lot.
|
|
|
|
Some requests like `LOCKCONTENT` do need full duplex communication like
|
|
websockets provide. But, it might be more web native to only use websockets
|
|
for that request, and not for everything.
|
|
|
|
## approach 2: web-native API
|
|
|
|
Another approach is to define a web-native API with endpoints that
|
|
correspond to each action in the P2P protocol.
|
|
|
|
Something like this:
|
|
|
|
> POST /git-annex/v1/AUTH?clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925 HTTP/1.0
|
|
< AUTH-SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6
|
|
|
|
> POST /git-annex/v1/CHECKPRESENT?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
|
|
> SUCCESS
|
|
|
|
> POST /git-annex/v1/PUT-FROM?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
|
|
< PUT-FROM 0
|
|
|
|
> POST /git-annex/v1/PUT?key=SHA1--foo&associatedfile=bar&put-from=0&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
|
|
> Content-Type: application/octet-stream
|
|
> Content-Length: 20
|
|
> foo
|
|
> {"valid": true}
|
|
< {"stored": true}
|
|
|
|
(In the last example above "foo" is the content, it is followed by a line of json.
|
|
This seems better than needing an entire other request to indicate validitity.)
|
|
|
|
This needs a more complex spec. But it's easier for others to implement,
|
|
especially since it does not need a session identifier, so the HTTP server can
|
|
be stateless.
|
|
|
|
A full draft protocol for this is being developed at [[p2p_protocol_over_http/draft1]].
|
|
|
|
## HTTP GET
|
|
|
|
It should be possible to support a regular HTTP get of a key, with
|
|
no additional parameters, so that annex objects can be served to other clients
|
|
from this web server.
|
|
|
|
> GET /git-annex/key/SHA1--foo HTTP/1.0
|
|
< foo
|
|
|
|
Although this would be a special case, not used by git-annex, because the P2P
|
|
protocol's GET has the complication of offsets, and of the server sending
|
|
VALID/INVALID after the content, and of needing to know the client's UUID in
|
|
order to update the location log.
|
|
|
|
## Problem: CONNECT
|
|
|
|
The CONNECT message allows both sides of the P2P protocol to send DATA
|
|
messages in any order. This seems difficult to encapsulate in HTTP.
|
|
|
|
Probably this can be not implemented, it's probably not needed for a HTTP
|
|
remote? This is used to tunnel git protocol over the P2P protocol, but for
|
|
a HTTP remote the git repository can be accessed over HTTP as well.
|
|
|
|
## security
|
|
|
|
Should support HTTPS and/or be limited to only HTTPS.
|
|
|
|
Authentication via http basic auth?
|