From 95ba4d4480d975c8c0df325ae0bfb22a25b60788 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 5 Jul 2024 10:08:43 -0400 Subject: [PATCH] thoughts on CGI, and use json --- doc/design/p2p_protocol.mdwn | 2 + doc/design/p2p_protocol_over_http.mdwn | 120 ++++++++++-------- doc/design/p2p_protocol_over_http/draft1.mdwn | 111 ++++++++-------- 3 files changed, 121 insertions(+), 112 deletions(-) diff --git a/doc/design/p2p_protocol.mdwn b/doc/design/p2p_protocol.mdwn index 43542c06bb..c4f4aac27b 100644 --- a/doc/design/p2p_protocol.mdwn +++ b/doc/design/p2p_protocol.mdwn @@ -133,6 +133,8 @@ To remove a key's content from the server, the client sends: The server responds with either SUCCESS or FAILURE. +Note that if the content was not present, SUCCESS will be returned. + In protocol version 2, the server can optionally reply with SUCCESS-PLUS or FAILURE-PLUS. Each has a subsequent list of UUIDs of repositories that the content was removed from. diff --git a/doc/design/p2p_protocol_over_http.mdwn b/doc/design/p2p_protocol_over_http.mdwn index c402ca5d8a..6fe7ea4d4e 100644 --- a/doc/design/p2p_protocol_over_http.mdwn +++ b/doc/design/p2p_protocol_over_http.mdwn @@ -18,72 +18,80 @@ With the [[passthrough_proxy]], this would let clients configure a single http remote that accesses a more complicated network of git-annex repositories. -## approach 1: encapsulation +## integration with git -One approach is to encapsulate the P2P protocol inside HTTP. This has the -benefit of being simple to think about. It is not very web-native though. +A webserver that is configured to serve a git repository either serves the +files in the repository with dumb http, or uses the git-http-backend CGI +program for url paths under eg `/git/`. -There would be a single API endpoint. The client connects and sends a -request that encapsulates one or more lines in the P2P protocol. The server -sends a response that encapsulates one or more lines in the P2P -protocol. +To integrate with that, git-annex would need a git-annex-http-backend CGI +program, that the webserver is configured to run for url paths under +`/git/.*/annex/`. -For example (eliding the full HTTP responses, only showing the data): +So, for a remote with an url `http://example.com/git/foo`, git-annex would +use paths under `http://example.com/git/foo/annex/` to run its CGI. - > POST /git-annex HTTP/1.0 - > Content-Type: x-git-annex-p2p - > Content-Length: ... - > - > AUTH 79a5a1f4-07e8-11ef-873d-97f93ca91925 - < AUTH-SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 +But, the CGI interface is a poor match for the P2P protocol. - > POST /git-annex HTTP/1.0 - > Content-Type: x-git-annex-p2p - > Content-Length: ... - > - > VERSION 1 - < VERSION 1 +A particular problem is that `LOCKCONTENT` would need to be in one CGI +request, followed by another request to `UNLOCKCONTENT`. Unless +git-annex-http-backend forked a daemon to keep the content locked, it would +not be able to retain a file lock across the 2 requests. While the 10 +minute retention lock would paper over that, UNLOCKCONTENT would not be +able to delete the retention lock, because there is no way to know if +another LOCKCONTENT was received later. So LOCKCONTENT would always lock +content for 10 minutes. Which would result in some undesirable behaviors. - > POST /git-annex HTTP/1.0 - > Content-Type: x-git-annex-p2p - > Content-Length: ... - > - > CHECKPRESENT SHA1--foo - < SUCCESS +Another problem is with proxies and clusters. The CGI would need to open +ssh (or http) connections to the proxied repositories and cluster nodes +each time it is run. That would add a lot of latency to every request. - > POST /git-annex HTTP/1.0 - > Content-Type: x-git-annex-p2p - > Content-Length: ... - > - > PUT bar SHA1--bar - < PUT-FROM 0 +And running a git-annex process once per CGI request also makes git-annex's +own startup speed, which is ok but not great, add latency. And each time +the CGI needed to change the git-annex branch, it would have to commit on +shutdown. Lots of time and space optimisations would be prevented by using +the CGI interface. - > POST /git-annex HTTP/1.0 - > Content-Type: x-git-annex-p2p - > Content-Length: ... - > - > DATA 3 - > foo - > VALID - < SUCCESS +So, rather than having the CGI program do anything in the repository +itself, have it pass each request through to a long-running server. +(This does have the downside that files would get double-copied +through the CGI, which adds some overhead.) +A reasonable way to do that would be to have a webserver speaking a +HTTP version of the git-annex P2P protocol and the CGI just talks to that. -Note that, since VERSION is negotiated in one request, the HTTP server -needs to know that a series of requests are part of the same P2P protocol -session. In the example above, it would not have a good way to do that. -One solution would be to add a session identifier UUID to each request. +The CGI program then becomes tiny, and just needs to know the url to +connect to the git-annex HTTP server. -## approach 2: websockets +Alternatively, a remote's configuration could include that url, and +then we don't need the complication and overhead of the CGI program at all. +Eg: + + git config remote.origin.annex-url http://example.com:8080/ + +So, the rest of this design will focus on implementing that. The CGI +program can be added later if desired, so avoid users needing to configure +an additional thing. + +Note that, one nice benefit of having a separate annex-url is it allows +having remote.origin.url on eg github, but with an annex-url configured +that remote can also be used as a git-annex repository. + +## approach 1: websockets The client connects to the server over a websocket. From there on, the protocol is encapsulated in websockets. -This seems nice and simple, but again not very web native. +This seems nice and simple to implement, but not very web native. Anyone +wanting to talk to this web server would need to understand the P2P +protocol. Just to upload a file would need to deal with AUTH, +AUTH-SUCCESS, AUTH-FAILURE, VERSION, PUT, ALREADY-HAVE, PUT-FROM, DATA, +INVALID, VALID, SUCCESS, and FAILURE messages. Seems like a lot. -Some requests like `LOCKCONTENT` seem likely to need full duplex -communication like websockets provide. But, it might be more web native to -only use websockets for that request, and not for everything. +Some requests like `LOCKCONTENT` do need full duplex communication like +websockets provide. But, it might be more web native to only use websockets +for that request, and not for everything. -## approach 3: HTTP API +## approach 2: web-native API Another approach is to define a web-native API with endpoints that correspond to each action in the P2P protocol. @@ -101,13 +109,13 @@ Something like this: > POST /git-annex/v1/PUT?key=SHA1--foo&associatedfile=bar&put-from=0&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0 > Content-Type: application/octet-stream - > Content-Length: 4 - > foo1 - < SUCCESS + > Content-Length: 20 + > foo + > {"valid": true} + < {"stored": true} -(In the last example above "foo" is the content, there is an additional byte at the end that -is 1 for VALID and 0 for INVALID. This seems better than needing an entire -other request to indicate validitity.) +(In the last example above "foo" is the content, it is followed by a line of json. +This seems better than needing an entire other request to indicate validitity.) This needs a more complex spec. But it's easier for others to implement, especially since it does not need a session identifier, so the HTTP server can diff --git a/doc/design/p2p_protocol_over_http/draft1.mdwn b/doc/design/p2p_protocol_over_http/draft1.mdwn index 20ec846514..071321537b 100644 --- a/doc/design/p2p_protocol_over_http/draft1.mdwn +++ b/doc/design/p2p_protocol_over_http/draft1.mdwn @@ -73,14 +73,14 @@ Checks if a key is currently present on the server. Example: > POST /git-annex/v3/checkpresent?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1 - < SUCCESS + < {"present": true} There is one required additional parameter, `key`. The body of the request is empty. -The server responds with "SUCCESS" if the key is present -or "FAILURE" if it is not present. +The server responds with a JSON object with a "present" field that is true +if the key is present, or false if it is not present. ### lockcontent @@ -106,24 +106,22 @@ Remove a key's content from the server. Example: > POST /git-annex/v3/remove?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1 - < SUCCESS + < {"removed": true} There is one required additional parameter, `key`. The body of the request is empty. -The server responds with "SUCCESS" if the key was removed, -or "FAILURE" if the key was not able to be removed. +The server responds with a JSON object with a "removed" field that is true +if the key was removed (or was not present on the server), +or false if the key was not able to be removed. -The server can also respond with "SUCCESS-PLUS" or "FAILURE-PLUS". -Each has a subsequent list of UUIDs of repositories -that the content was removed from. For example: +The JSON object can have an additional field "plusuuids" that is a list of +UUIDs of other repositories that the content was removed from. - SUCCESS-PLUS 702ce472-38a1-11ef-864f-23851a2edf71 707dea20-38a1-11ef-96a4-fb7e8c8369f0 - -If the server was prevented from trying to remove the key due to a policy -(eg due to being read-only or append-only, it will respond with "ERROR", -followed by a space and an error message. +If the server does not allow removing the key due to a policy +(eg due to being read-only or append-only), it will respond with a JSON +object with an "error" field that has an error message as its value. ## remove-before @@ -132,13 +130,15 @@ Remove a key's content from the server, but only before a specified time. Example: > POST /git-annex/v3/remove-before?timestamp=4949292929&key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1 - < SUCCESS + < {"removed": true} This is the same as the `remove` request, but with an additional parameter, `timestamp`. If the server's monotonic clock is past the specified timestamp, the -removal will fail. This is used to avoid removing content after a point in +removal will fail and the server will respond with: `{"removed": false}` + +This is used to avoid removing content after a point in time where it is no longer locked in other repostitories. ## gettimestamp @@ -148,12 +148,12 @@ Gets the current timestamp from the server. Example: > POST /git-annex/v3/gettimestamp?clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1 - < TIMESTAMP 59459392 + < {"timestamp": 59459392} The body of the request is empty. -The server responds with "TIMESTAMP" followed by a space and the current -value of its monotonic clock, as a number of seconds. +The server responds with JSON object with a timestmap field that has the +current value of its monotonic clock, as a number of seconds. Important: If multiple servers are serving this protocol for the same repository, they MUST all use the same monotonic clock. @@ -166,13 +166,14 @@ Example: > POST /git-annex/v3/put?key=SHA1--foo&associatedfile=bar&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1 > Content-Type: application/octet-stream - > Content-Length: 4 - > foo1 - < SUCCESS + > Content-Length: 20 + > foo + > {"valid": true} + < {"stored": true} There is one required additional parameter, `key`. -There is are also these optional parameters: +There are are also these optional parameters: * `associatedfile` @@ -186,28 +187,27 @@ There is are also these optional parameters: The body of the request is the content of the key, starting from the specified offset or from the beginning. After the content of the key, -there is one more byte. +there is a newline, followed by a JSON object. -The additional byte is "1" to indicate that the content was not changed -while it was being sent, or "0" to indicate that modified content was sent -and should be disregarded by the server. (This corresponds +The JSON object has a field "valid" that is true when the content +was not changed while it was being sent, or false when modified +content was sent and should be disregarded by the server. (This corresponds to the `VALID` and `INVALID` messages in the P2P protocol.) The `Content-Type` header should be `application/octet-stream`. The `Content-Length` header should be set to the length of the body. -The server responds with `SUCCESS` if it received the data and stored the -content. If it was unable to do so, it responds with `FAILURE`. +The server responds with a JSON object with a field "stored" +that is true if it received the data and stored the +content. -The server can also reply with `SUCCESS-PLUS`, which has a subsequent list of -UUIDs of repositories that the content was stored to. For example: +The JSON object can have an additional field "plusuuids" that is a list of +UUIDs of other repositories that the content was stored to. - SUCCESS-PLUS 702ce472-38a1-11ef-864f-23851a2edf71 707dea20-38a1-11ef-96a4-fb7e8c8369f0 - -If the server was prevented from storing the key due to a policy -(eg due to being read-only), it will respond with "ERROR", followed -by a space and an error message. +If the server does not allow storing the key due to a policy +(eg due to being read-only or append-only), it will respond with a JSON +object with an "error" field that has an error message as its value. ### putoffset @@ -220,17 +220,18 @@ the `put` request failing. Example: > POST /git-annex/v3/putoffset?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1 - < 10 + < {"offset": 10} There is one required additional parameter, `key`. The body of the request is empty. -The server responds with the largest allowable offset. +The server responds with a JSON object with an "offset" field that +is the largest allowable offset. -If the server was prevented from storing the key due to a policy -(eg due to being read-only), it will respond with "ERROR", followed -by a space and an error message. +If the server does not allow storing the key due to a policy +(eg due to being read-only or append-only), it will respond with a JSON +object with an "error" field that has an error message as its value. [Implementation note: This will be implemented by sending `PUT` and returning the `PUT-FROM` offset. To avoid leaving the P2P protocol stuck @@ -246,8 +247,9 @@ Example: > POST /git-annex/v3/get?key=SHA1--foo&associatedfile=bar&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1 < Content-Type: application/octet-stream - < Content-Length: 4 - < foo1 + > Content-Length: 20 + > foo + > {"valid": true} There is one required additional parameter, `key`. @@ -271,17 +273,14 @@ The server's response will have a `Content-Length` header set to the length of the body. The server's response body is the content of the key, from the specified -offset. After the content of the key, there is one more byte. +offset. After the content of the key, there is a newline, followed by a +JSON object. -The additional byte is "1" to indicate that the content was not changed -while it was being sent, or "0" to indicate that modified content was sent -and should be discarded by the client. (This corresponds -to the `VALID` and `INVALID` messages in the P2P protocol.) - -Note that, if the server is not able to send the content of the requested -key, its response body will consist of "0", eg 0 bytes of content which is -not valid. On the other hand, a response body of "1" is used for an empty -key which is valid. +The JSON object has a field "valid" that is true when the content +was not changed while it was being sent, or false when whatever +content was sent is not the actual content of the key and should be +disregared. (This corresponds to the `VALID` and `INVALID` messages +in the P2P protocol.) ## simple HTTP GET @@ -301,6 +300,6 @@ the content of a key. this HTTP protocol to support it. `CONNECT` is not supported, and due to the bi-directional message passing -nature of it, it cannot easily be done over HTTP. It should not be -necessary anyway, because the git repository itself can be accessed over -HTTP. +nature of it, it cannot easily be done over HTTP (would need websockets). +It should not be necessary anyway, because the git repository itself can be +accessed over HTTP.