thoughts on CGI, and use json

This commit is contained in:
Joey Hess 2024-07-05 10:08:43 -04:00
parent 3f9569e27f
commit 95ba4d4480
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 121 additions and 112 deletions

View file

@ -133,6 +133,8 @@ To remove a key's content from the server, the client sends:
The server responds with either SUCCESS or FAILURE.
Note that if the content was not present, SUCCESS will be returned.
In protocol version 2, the server can optionally reply with SUCCESS-PLUS
or FAILURE-PLUS. Each has a subsequent list of UUIDs of repositories
that the content was removed from.

View file

@ -18,72 +18,80 @@ With the [[passthrough_proxy]], this would let clients configure a single
http remote that accesses a more complicated network of git-annex
repositories.
## approach 1: encapsulation
## integration with git
One approach is to encapsulate the P2P protocol inside HTTP. This has the
benefit of being simple to think about. It is not very web-native though.
A webserver that is configured to serve a git repository either serves the
files in the repository with dumb http, or uses the git-http-backend CGI
program for url paths under eg `/git/`.
There would be a single API endpoint. The client connects and sends a
request that encapsulates one or more lines in the P2P protocol. The server
sends a response that encapsulates one or more lines in the P2P
protocol.
To integrate with that, git-annex would need a git-annex-http-backend CGI
program, that the webserver is configured to run for url paths under
`/git/.*/annex/`.
For example (eliding the full HTTP responses, only showing the data):
So, for a remote with an url `http://example.com/git/foo`, git-annex would
use paths under `http://example.com/git/foo/annex/` to run its CGI.
> POST /git-annex HTTP/1.0
> Content-Type: x-git-annex-p2p
> Content-Length: ...
>
> AUTH 79a5a1f4-07e8-11ef-873d-97f93ca91925
< AUTH-SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6
But, the CGI interface is a poor match for the P2P protocol.
> POST /git-annex HTTP/1.0
> Content-Type: x-git-annex-p2p
> Content-Length: ...
>
> VERSION 1
< VERSION 1
A particular problem is that `LOCKCONTENT` would need to be in one CGI
request, followed by another request to `UNLOCKCONTENT`. Unless
git-annex-http-backend forked a daemon to keep the content locked, it would
not be able to retain a file lock across the 2 requests. While the 10
minute retention lock would paper over that, UNLOCKCONTENT would not be
able to delete the retention lock, because there is no way to know if
another LOCKCONTENT was received later. So LOCKCONTENT would always lock
content for 10 minutes. Which would result in some undesirable behaviors.
> POST /git-annex HTTP/1.0
> Content-Type: x-git-annex-p2p
> Content-Length: ...
>
> CHECKPRESENT SHA1--foo
< SUCCESS
Another problem is with proxies and clusters. The CGI would need to open
ssh (or http) connections to the proxied repositories and cluster nodes
each time it is run. That would add a lot of latency to every request.
> POST /git-annex HTTP/1.0
> Content-Type: x-git-annex-p2p
> Content-Length: ...
>
> PUT bar SHA1--bar
< PUT-FROM 0
And running a git-annex process once per CGI request also makes git-annex's
own startup speed, which is ok but not great, add latency. And each time
the CGI needed to change the git-annex branch, it would have to commit on
shutdown. Lots of time and space optimisations would be prevented by using
the CGI interface.
> POST /git-annex HTTP/1.0
> Content-Type: x-git-annex-p2p
> Content-Length: ...
>
> DATA 3
> foo
> VALID
< SUCCESS
So, rather than having the CGI program do anything in the repository
itself, have it pass each request through to a long-running server.
(This does have the downside that files would get double-copied
through the CGI, which adds some overhead.)
A reasonable way to do that would be to have a webserver speaking a
HTTP version of the git-annex P2P protocol and the CGI just talks to that.
Note that, since VERSION is negotiated in one request, the HTTP server
needs to know that a series of requests are part of the same P2P protocol
session. In the example above, it would not have a good way to do that.
One solution would be to add a session identifier UUID to each request.
The CGI program then becomes tiny, and just needs to know the url to
connect to the git-annex HTTP server.
## approach 2: websockets
Alternatively, a remote's configuration could include that url, and
then we don't need the complication and overhead of the CGI program at all.
Eg:
git config remote.origin.annex-url http://example.com:8080/
So, the rest of this design will focus on implementing that. The CGI
program can be added later if desired, so avoid users needing to configure
an additional thing.
Note that, one nice benefit of having a separate annex-url is it allows
having remote.origin.url on eg github, but with an annex-url configured
that remote can also be used as a git-annex repository.
## approach 1: websockets
The client connects to the server over a websocket. From there on,
the protocol is encapsulated in websockets.
This seems nice and simple, but again not very web native.
This seems nice and simple to implement, but not very web native. Anyone
wanting to talk to this web server would need to understand the P2P
protocol. Just to upload a file would need to deal with AUTH,
AUTH-SUCCESS, AUTH-FAILURE, VERSION, PUT, ALREADY-HAVE, PUT-FROM, DATA,
INVALID, VALID, SUCCESS, and FAILURE messages. Seems like a lot.
Some requests like `LOCKCONTENT` seem likely to need full duplex
communication like websockets provide. But, it might be more web native to
only use websockets for that request, and not for everything.
Some requests like `LOCKCONTENT` do need full duplex communication like
websockets provide. But, it might be more web native to only use websockets
for that request, and not for everything.
## approach 3: HTTP API
## approach 2: web-native API
Another approach is to define a web-native API with endpoints that
correspond to each action in the P2P protocol.
@ -101,13 +109,13 @@ Something like this:
> POST /git-annex/v1/PUT?key=SHA1--foo&associatedfile=bar&put-from=0&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
> Content-Type: application/octet-stream
> Content-Length: 4
> foo1
< SUCCESS
> Content-Length: 20
> foo
> {"valid": true}
< {"stored": true}
(In the last example above "foo" is the content, there is an additional byte at the end that
is 1 for VALID and 0 for INVALID. This seems better than needing an entire
other request to indicate validitity.)
(In the last example above "foo" is the content, it is followed by a line of json.
This seems better than needing an entire other request to indicate validitity.)
This needs a more complex spec. But it's easier for others to implement,
especially since it does not need a session identifier, so the HTTP server can

View file

@ -73,14 +73,14 @@ Checks if a key is currently present on the server.
Example:
> POST /git-annex/v3/checkpresent?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
< SUCCESS
< {"present": true}
There is one required additional parameter, `key`.
The body of the request is empty.
The server responds with "SUCCESS" if the key is present
or "FAILURE" if it is not present.
The server responds with a JSON object with a "present" field that is true
if the key is present, or false if it is not present.
### lockcontent
@ -106,24 +106,22 @@ Remove a key's content from the server.
Example:
> POST /git-annex/v3/remove?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
< SUCCESS
< {"removed": true}
There is one required additional parameter, `key`.
The body of the request is empty.
The server responds with "SUCCESS" if the key was removed,
or "FAILURE" if the key was not able to be removed.
The server responds with a JSON object with a "removed" field that is true
if the key was removed (or was not present on the server),
or false if the key was not able to be removed.
The server can also respond with "SUCCESS-PLUS" or "FAILURE-PLUS".
Each has a subsequent list of UUIDs of repositories
that the content was removed from. For example:
The JSON object can have an additional field "plusuuids" that is a list of
UUIDs of other repositories that the content was removed from.
SUCCESS-PLUS 702ce472-38a1-11ef-864f-23851a2edf71 707dea20-38a1-11ef-96a4-fb7e8c8369f0
If the server was prevented from trying to remove the key due to a policy
(eg due to being read-only or append-only, it will respond with "ERROR",
followed by a space and an error message.
If the server does not allow removing the key due to a policy
(eg due to being read-only or append-only), it will respond with a JSON
object with an "error" field that has an error message as its value.
## remove-before
@ -132,13 +130,15 @@ Remove a key's content from the server, but only before a specified time.
Example:
> POST /git-annex/v3/remove-before?timestamp=4949292929&key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
< SUCCESS
< {"removed": true}
This is the same as the `remove` request, but with an additional parameter,
`timestamp`.
If the server's monotonic clock is past the specified timestamp, the
removal will fail. This is used to avoid removing content after a point in
removal will fail and the server will respond with: `{"removed": false}`
This is used to avoid removing content after a point in
time where it is no longer locked in other repostitories.
## gettimestamp
@ -148,12 +148,12 @@ Gets the current timestamp from the server.
Example:
> POST /git-annex/v3/gettimestamp?clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
< TIMESTAMP 59459392
< {"timestamp": 59459392}
The body of the request is empty.
The server responds with "TIMESTAMP" followed by a space and the current
value of its monotonic clock, as a number of seconds.
The server responds with JSON object with a timestmap field that has the
current value of its monotonic clock, as a number of seconds.
Important: If multiple servers are serving this protocol for the same
repository, they MUST all use the same monotonic clock.
@ -166,13 +166,14 @@ Example:
> POST /git-annex/v3/put?key=SHA1--foo&associatedfile=bar&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
> Content-Type: application/octet-stream
> Content-Length: 4
> foo1
< SUCCESS
> Content-Length: 20
> foo
> {"valid": true}
< {"stored": true}
There is one required additional parameter, `key`.
There is are also these optional parameters:
There are are also these optional parameters:
* `associatedfile`
@ -186,28 +187,27 @@ There is are also these optional parameters:
The body of the request is the content of the key, starting from the
specified offset or from the beginning. After the content of the key,
there is one more byte.
there is a newline, followed by a JSON object.
The additional byte is "1" to indicate that the content was not changed
while it was being sent, or "0" to indicate that modified content was sent
and should be disregarded by the server. (This corresponds
The JSON object has a field "valid" that is true when the content
was not changed while it was being sent, or false when modified
content was sent and should be disregarded by the server. (This corresponds
to the `VALID` and `INVALID` messages in the P2P protocol.)
The `Content-Type` header should be `application/octet-stream`.
The `Content-Length` header should be set to the length of the body.
The server responds with `SUCCESS` if it received the data and stored the
content. If it was unable to do so, it responds with `FAILURE`.
The server responds with a JSON object with a field "stored"
that is true if it received the data and stored the
content.
The server can also reply with `SUCCESS-PLUS`, which has a subsequent list of
UUIDs of repositories that the content was stored to. For example:
The JSON object can have an additional field "plusuuids" that is a list of
UUIDs of other repositories that the content was stored to.
SUCCESS-PLUS 702ce472-38a1-11ef-864f-23851a2edf71 707dea20-38a1-11ef-96a4-fb7e8c8369f0
If the server was prevented from storing the key due to a policy
(eg due to being read-only), it will respond with "ERROR", followed
by a space and an error message.
If the server does not allow storing the key due to a policy
(eg due to being read-only or append-only), it will respond with a JSON
object with an "error" field that has an error message as its value.
### putoffset
@ -220,17 +220,18 @@ the `put` request failing.
Example:
> POST /git-annex/v3/putoffset?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
< 10
< {"offset": 10}
There is one required additional parameter, `key`.
The body of the request is empty.
The server responds with the largest allowable offset.
The server responds with a JSON object with an "offset" field that
is the largest allowable offset.
If the server was prevented from storing the key due to a policy
(eg due to being read-only), it will respond with "ERROR", followed
by a space and an error message.
If the server does not allow storing the key due to a policy
(eg due to being read-only or append-only), it will respond with a JSON
object with an "error" field that has an error message as its value.
[Implementation note: This will be implemented by sending `PUT` and
returning the `PUT-FROM` offset. To avoid leaving the P2P protocol stuck
@ -246,8 +247,9 @@ Example:
> POST /git-annex/v3/get?key=SHA1--foo&associatedfile=bar&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
< Content-Type: application/octet-stream
< Content-Length: 4
< foo1
> Content-Length: 20
> foo
> {"valid": true}
There is one required additional parameter, `key`.
@ -271,17 +273,14 @@ The server's response will have a `Content-Length` header
set to the length of the body.
The server's response body is the content of the key, from the specified
offset. After the content of the key, there is one more byte.
offset. After the content of the key, there is a newline, followed by a
JSON object.
The additional byte is "1" to indicate that the content was not changed
while it was being sent, or "0" to indicate that modified content was sent
and should be discarded by the client. (This corresponds
to the `VALID` and `INVALID` messages in the P2P protocol.)
Note that, if the server is not able to send the content of the requested
key, its response body will consist of "0", eg 0 bytes of content which is
not valid. On the other hand, a response body of "1" is used for an empty
key which is valid.
The JSON object has a field "valid" that is true when the content
was not changed while it was being sent, or false when whatever
content was sent is not the actual content of the key and should be
disregared. (This corresponds to the `VALID` and `INVALID` messages
in the P2P protocol.)
## simple HTTP GET
@ -301,6 +300,6 @@ the content of a key.
this HTTP protocol to support it.
`CONNECT` is not supported, and due to the bi-directional message passing
nature of it, it cannot easily be done over HTTP. It should not be
necessary anyway, because the git repository itself can be accessed over
HTTP.
nature of it, it cannot easily be done over HTTP (would need websockets).
It should not be necessary anyway, because the git repository itself can be
accessed over HTTP.