thoughts on CGI, and use json

2024-07-05 10:08:43 -04:00 · 2024-07-05 10:08:43 -04:00 · 95ba4d4480
commit 95ba4d4480
parent 3f9569e27f
3 changed files with 121 additions and 112 deletions
--- a/doc/design/p2p_protocol.mdwn
+++ b/doc/design/p2p_protocol.mdwn
@ -133,6 +133,8 @@ To remove a key's content from the server, the client sends:

 The server responds with either SUCCESS or FAILURE.

+Note that if the content was not present, SUCCESS will be returned.
+
 In protocol version 2, the server can optionally reply with SUCCESS-PLUS
 or FAILURE-PLUS. Each has a subsequent list of UUIDs of repositories
 that the content was removed from.
--- a/doc/design/p2p_protocol_over_http.mdwn
+++ b/doc/design/p2p_protocol_over_http.mdwn
@ -18,72 +18,80 @@ With the [[passthrough_proxy]], this would let clients configure a single
 http remote that accesses a more complicated network of git-annex
 repositories.

-## approach 1: encapsulation
+## integration with git

-One approach is to encapsulate the P2P protocol inside HTTP. This has the
-benefit of being simple to think about. It is not very web-native though.
+A webserver that is configured to serve a git repository either serves the
+files in the repository with dumb http, or uses the git-http-backend CGI
+program for url paths under eg `/git/`.

-There would be a single API endpoint. The client connects and sends a
-request that encapsulates one or more lines in the P2P protocol. The server
-sends a response that encapsulates one or more lines in the P2P
-protocol.
+To integrate with that, git-annex would need a git-annex-http-backend CGI
+program, that the webserver is configured to run for url paths under
+`/git/.*/annex/`.

-For example (eliding the full HTTP responses, only showing the data):
+So, for a remote with an url `http://example.com/git/foo`, git-annex would
+use paths under `http://example.com/git/foo/annex/` to run its CGI.

-    > POST /git-annex HTTP/1.0
-    > Content-Type: x-git-annex-p2p
-    > Content-Length: ...
-    > 
-    > AUTH 79a5a1f4-07e8-11ef-873d-97f93ca91925 
-    < AUTH-SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6
+But, the CGI interface is a poor match for the P2P protocol. 

-    > POST /git-annex HTTP/1.0
-    > Content-Type: x-git-annex-p2p
-    > Content-Length: ...
-    > 
-    > VERSION 1
-    < VERSION 1
+A particular problem is that `LOCKCONTENT` would need to be in one CGI
+request, followed by another request to `UNLOCKCONTENT`. Unless
+git-annex-http-backend forked a daemon to keep the content locked, it would
+not be able to retain a file lock across the 2 requests. While the 10
+minute retention lock would paper over that, UNLOCKCONTENT would not be
+able to delete the retention lock, because there is no way to know if
+another LOCKCONTENT was received later. So LOCKCONTENT would always lock
+content for 10 minutes. Which would result in some undesirable behaviors.

-    > POST /git-annex HTTP/1.0
-    > Content-Type: x-git-annex-p2p
-    > Content-Length: ...
-    > 
-    > CHECKPRESENT SHA1--foo
-    < SUCCESS
+Another problem is with proxies and clusters. The CGI would need to open
+ssh (or http) connections to the proxied repositories and cluster nodes
+each time it is run. That would add a lot of latency to every request.

-    > POST /git-annex HTTP/1.0
-    > Content-Type: x-git-annex-p2p
-    > Content-Length: ...
-    > 
-    > PUT bar SHA1--bar
-    < PUT-FROM 0
+And running a git-annex process once per CGI request also makes git-annex's
+own startup speed, which is ok but not great, add latency. And each time
+the CGI needed to change the git-annex branch, it would have to commit on
+shutdown. Lots of time and space optimisations would be prevented by using
+the CGI interface.

-    > POST /git-annex HTTP/1.0
-    > Content-Type: x-git-annex-p2p
-    > Content-Length: ...
-    > 
-    > DATA 3
-    > foo
-    > VALID
-    < SUCCESS
+So, rather than having the CGI program do anything in the repository
+itself, have it pass each request through to a long-running server.
+(This does have the downside that files would get double-copied
+through the CGI, which adds some overhead.)
+A reasonable way to do that would be to have a webserver speaking a
+HTTP version of the git-annex P2P protocol and the CGI just talks to that.

-Note that, since VERSION is negotiated in one request, the HTTP server
-needs to know that a series of requests are part of the same P2P protocol
-session. In the example above, it would not have a good way to do that.
-One solution would be to add a session identifier UUID to each request.
+The CGI program then becomes tiny, and just needs to know the url to
+connect to the git-annex HTTP server.

-## approach 2: websockets
+Alternatively, a remote's configuration could include that url, and
+then we don't need the complication and overhead of the CGI program at all.
+Eg:
+
+    git config remote.origin.annex-url http://example.com:8080/
+
+So, the rest of this design will focus on implementing that. The CGI
+program can be added later if desired, so avoid users needing to configure
+an additional thing.
+
+Note that, one nice benefit of having a separate annex-url is it allows
+having remote.origin.url on eg github, but with an annex-url configured
+that remote can also be used as a git-annex repository.
+
+## approach 1: websockets

 The client connects to the server over a websocket. From there on,
 the protocol is encapsulated in websockets.

-This seems nice and simple, but again not very web native. 
+This seems nice and simple to implement, but not very web native. Anyone
+wanting to talk to this web server would need to understand the P2P
+protocol. Just to upload a file would need to deal with AUTH,
+AUTH-SUCCESS, AUTH-FAILURE, VERSION, PUT, ALREADY-HAVE, PUT-FROM, DATA,
+INVALID, VALID, SUCCESS, and FAILURE messages. Seems like a lot.

-Some requests like `LOCKCONTENT` seem likely to need full duplex
-communication like websockets provide. But, it might be more web native to
-only use websockets for that request, and not for everything.
+Some requests like `LOCKCONTENT` do need full duplex communication like
+websockets provide. But, it might be more web native to only use websockets
+for that request, and not for everything.

-## approach 3: HTTP API
+## approach 2: web-native API

 Another approach is to define a web-native API with endpoints that
 correspond to each action in the P2P protocol. 
@ -101,13 +109,13 @@ Something like this:

    > POST /git-annex/v1/PUT?key=SHA1--foo&associatedfile=bar&put-from=0&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
    > Content-Type: application/octet-stream
-    > Content-Length: 4
-    > foo1
-    < SUCCESS
+    > Content-Length: 20
+    > foo
+    > {"valid": true}
+    < {"stored": true}

-(In the last example above "foo" is the content, there is an additional byte at the end that
-is 1 for VALID and 0 for INVALID. This seems better than needing an entire
-other request to indicate validitity.)
+(In the last example above "foo" is the content, it is followed by a line of json.
+This seems better than needing an entire other request to indicate validitity.)

 This needs a more complex spec. But it's easier for others to implement,
 especially since it does not need a session identifier, so the HTTP server can 
--- a/doc/design/p2p_protocol_over_http/draft1.mdwn
+++ b/doc/design/p2p_protocol_over_http/draft1.mdwn
@ -73,14 +73,14 @@ Checks if a key is currently present on the server.
 Example:

    > POST /git-annex/v3/checkpresent?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
-    < SUCCESS
+    < {"present": true}

 There is one required additional parameter, `key`.

 The body of the request is empty.

-The server responds with "SUCCESS" if the key is present
-or "FAILURE" if it is not present.
+The server responds with a JSON object with a "present" field that is true
+if the key is present, or false if it is not present.

 ### lockcontent

@ -106,24 +106,22 @@ Remove a key's content from the server.
 Example:

    > POST /git-annex/v3/remove?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
-    < SUCCESS
+    < {"removed": true}

 There is one required additional parameter, `key`.

 The body of the request is empty.

-The server responds with "SUCCESS" if the key was removed,
-or "FAILURE" if the key was not able to be removed.
+The server responds with a JSON object with a "removed" field that is true
+if the key was removed (or was not present on the server), 
+or false if the key was not able to be removed.

-The server can also respond with "SUCCESS-PLUS" or "FAILURE-PLUS".
-Each has a subsequent list of UUIDs of repositories
-that the content was removed from. For example:
+The JSON object can have an additional field "plusuuids" that is a list of
+UUIDs of other repositories that the content was removed from.

-    SUCCESS-PLUS 702ce472-38a1-11ef-864f-23851a2edf71 707dea20-38a1-11ef-96a4-fb7e8c8369f0
-
-If the server was prevented from trying to remove the key due to a policy
-(eg due to being read-only or append-only, it will respond with "ERROR",
-followed by a space and an error message.
+If the server does not allow removing the key due to a policy
+(eg due to being read-only or append-only), it will respond with a JSON
+object with an "error" field that has an error message as its value.

 ## remove-before

@ -132,13 +130,15 @@ Remove a key's content from the server, but only before a specified time.
 Example:

    > POST /git-annex/v3/remove-before?timestamp=4949292929&key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
-    < SUCCESS
+    < {"removed": true}

 This is the same as the `remove` request, but with an additional parameter,
 `timestamp`.

 If the server's monotonic clock is past the specified timestamp, the
-removal will fail. This is used to avoid removing content after a point in
+removal will fail and the server will respond with: `{"removed": false}`
+
+This is used to avoid removing content after a point in 
 time where it is no longer locked in other repostitories.

 ## gettimestamp
@ -148,12 +148,12 @@ Gets the current timestamp from the server.
 Example:

    > POST /git-annex/v3/gettimestamp?clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
-    < TIMESTAMP 59459392
+    < {"timestamp": 59459392}

 The body of the request is empty.

-The server responds with "TIMESTAMP" followed by a space and the current
-value of its monotonic clock, as a number of seconds.
+The server responds with JSON object with a timestmap field that has the
+current value of its monotonic clock, as a number of seconds.

 Important: If multiple servers are serving this protocol for the same
 repository, they MUST all use the same monotonic clock.
@ -166,13 +166,14 @@ Example:

    > POST /git-annex/v3/put?key=SHA1--foo&associatedfile=bar&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
    > Content-Type: application/octet-stream
-    > Content-Length: 4
-    > foo1
-    < SUCCESS
+    > Content-Length: 20
+    > foo
+    > {"valid": true}
+    < {"stored": true}

 There is one required additional parameter, `key`.

-There is are also these optional parameters:
+There are are also these optional parameters:

 * `associatedfile`

@ -186,28 +187,27 @@ There is are also these optional parameters:

 The body of the request is the content of the key, starting from the
 specified offset or from the beginning. After the content of the key,
-there is one more byte. 
+there is a newline, followed by a JSON object.

-The additional byte is "1" to indicate that the content was not changed
-while it was being sent, or "0" to indicate that modified content was sent
-and should be disregarded by the server. (This corresponds
+The JSON object has a field "valid" that is true when the content 
+was not changed while it was being sent, or false when modified
+content was sent and should be disregarded by the server. (This corresponds
 to the `VALID` and `INVALID` messages in the P2P protocol.)

 The `Content-Type` header should be `application/octet-stream`.

 The `Content-Length` header should be set to the length of the body.

-The server responds with `SUCCESS` if it received the data and stored the
-content. If it was unable to do so, it responds with `FAILURE`.
+The server responds with a JSON object with a field "stored"
+that is true if it received the data and stored the
+content.

-The server can also reply with `SUCCESS-PLUS`, which has a subsequent list of
-UUIDs of repositories that the content was stored to. For example:
+The JSON object can have an additional field "plusuuids" that is a list of
+UUIDs of other repositories that the content was stored to.

-    SUCCESS-PLUS 702ce472-38a1-11ef-864f-23851a2edf71 707dea20-38a1-11ef-96a4-fb7e8c8369f0
-
-If the server was prevented from storing the key due to a policy
-(eg due to being read-only), it will respond with "ERROR", followed
-by a space and an error message.
+If the server does not allow storing the key due to a policy
+(eg due to being read-only or append-only), it will respond with a JSON
+object with an "error" field that has an error message as its value.

 ### putoffset

@ -220,17 +220,18 @@ the `put` request failing.
 Example:

    > POST /git-annex/v3/putoffset?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
-    < 10
+    < {"offset": 10}

 There is one required additional parameter, `key`.

 The body of the request is empty.

-The server responds with the largest allowable offset.
+The server responds with a JSON object with an "offset" field that 
+is the largest allowable offset.

-If the server was prevented from storing the key due to a policy
-(eg due to being read-only), it will respond with "ERROR", followed
-by a space and an error message.
+If the server does not allow storing the key due to a policy
+(eg due to being read-only or append-only), it will respond with a JSON
+object with an "error" field that has an error message as its value.

 [Implementation note: This will be implemented by sending `PUT` and
 returning the `PUT-FROM` offset. To avoid leaving the P2P protocol stuck
@ -246,8 +247,9 @@ Example:

    > POST /git-annex/v3/get?key=SHA1--foo&associatedfile=bar&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.1
    < Content-Type: application/octet-stream
-    < Content-Length: 4
-    < foo1
+    > Content-Length: 20
+    > foo
+    > {"valid": true}

 There is one required additional parameter, `key`.

@ -271,17 +273,14 @@ The server's response will have a `Content-Length` header
 set to the length of the body.

 The server's response body is the content of the key, from the specified
-offset. After the content of the key, there is one more byte. 
+offset. After the content of the key, there is a newline, followed by a
+JSON object.

-The additional byte is "1" to indicate that the content was not changed
-while it was being sent, or "0" to indicate that modified content was sent
-and should be discarded by the client. (This corresponds
-to the `VALID` and `INVALID` messages in the P2P protocol.)
-
-Note that, if the server is not able to send the content of the requested
-key, its response body will consist of "0", eg 0 bytes of content which is
-not valid. On the other hand, a response body of "1" is used for an empty
-key which is valid.
+The JSON object has a field "valid" that is true when the content 
+was not changed while it was being sent, or false when whatever
+content was sent is not the actual content of the key and should be
+disregared. (This corresponds to the `VALID` and `INVALID` messages
+in the P2P protocol.)

 ## simple HTTP GET

@ -301,6 +300,6 @@ the content of a key.
 this HTTP protocol to support it.

 `CONNECT` is not supported, and due to the bi-directional message passing
-nature of it, it cannot easily be done over HTTP. It should not be
-necessary anyway, because the git repository itself can be accessed over
-HTTP.
+nature of it, it cannot easily be done over HTTP (would need websockets).
+It should not be necessary anyway, because the git repository itself can be
+accessed over HTTP.