planning

2024-10-22 11:09:47 -04:00 · 2024-10-22 11:09:47 -04:00 · 7dde035ac8
commit 7dde035ac8
parent 8baccda98f
2 changed files with 42 additions and 29 deletions
--- a/doc/design/passthrough_proxy.mdwn
+++ b/doc/design/passthrough_proxy.mdwn
@ -687,7 +687,7 @@ provides to the client.

 An example use case involves
 [presigned S3 urls](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html).
-When one of the proxy's nodes is a S3 bucket, having the client upload
+When the proxy is to a S3 bucket, having the client upload
 directly to S3 would avoid needing double traffic through the proxy's
 network.

@ -695,15 +695,33 @@ This would need a special remote that generates the presigned S3 url.
 Probably an external, so the external special remote protocol would need to
 be updated as well as the P2P protocol.

-Since an upload to a proxy can be distributed to multiple nodes, should
-the proxy be able to indicate more than one url that the client
-should upload to? Also the proxy might want an upload to still be sent to
+Since an upload to a cluster can be distributed to multiple nodes, should
+it be able to indicate more than one url that the client
+should upload to? Also the cluster might want an upload to still be sent to
 it in addition to url(s). Of course the downside is that the client would
-need to upload more than once, which eliminates one benefit of the proxy.
-So it might be reasonable to only support one url, but what if the proxy 
-has multiple remotes that want to provide urls, how does it pick which one
-wins?
+need to upload more than once, which eliminates one benefit of the cluster.
+
+> Seems reasonable to only allow this to specify 1 url for the client to
+> upload to. If a cluster has several remotes that can use urls, it would
+> need to pick 1, or it would need to have the client upload to it, and
+> distribute it to the multiple nodes.

 Is only an URL enough for the client to be able to upload to wherever? It
 may be that the HTTP verb is also necessary. Consider POST vs PUT. Some
 services might need additional HTTP headers.
+
+S3 can optionally verify the upload of a presigned url by using
+the Content-MD5 header. The right md5 would not be known when generating a
+presigned url, unless the key happened to by an md5 key. The client could
+hash the content and fill in an md5 in a template. Added complixity in this
+particular case does not seem likely to be worthwhile. git-annex does not
+usually have S3 verify the checksum.
+
+S3 also supports using POST from a web browser, which is similar to a
+presigned url:
+<https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-UsingHTTPPOST.html>
+This does have a bunch of headers but also uses `multipart/form-data`,
+so just dumping the file into the body won't work.
+Seems unneccessary to support since javascript should be able to access
+the file that the user has selected to upload, and PUT its content to the
+presigned url.
--- a/doc/todo/git-annex_proxies.mdwn
+++ b/doc/todo/git-annex_proxies.mdwn
@ -26,13 +26,23 @@ Planned schedule of work:

 [[!tag projects/openneuro]]

-## work notes
+## remaining things to do

-* Currently working on streaming special remotes via proxy 
-  in the `streamproxy` branch.
+* Streaming uploads to special remotes via the proxy. Possibly; if a
+  workable design can be developed. It seems difficult without changing the
+  external special remote protocol, unless a fifo is used. Make ORDERED
+  response in p2p protocol allow using a fifo?

-* Downloads from special remotes can stream (though using a temp file on
-  the proxy). Next: Streaming uploads via the proxy.
+* Indirect uploads when proxying for special remote is an alternative that
+  would work for OpenNeuro's use case.
+
+* If not implementing upload streaming to proxied special remotes, 
+  this needs to be addressed:
+  When an upload to a cluster is distributed to multiple special remotes,
+  a temporary file is written for each one, which may even happen in
+  parallel. This is a lot of extra work and may use excess disk space.
+  It should be possible to only write a single temp file.
+  (With streaming this wouldn't be an issue.)

 ## completed items for October's work on streaming through proxy to special remotes

@ -146,22 +156,9 @@ Planned schedule of work:
 * Resuming an interrupted download from proxied special remote makes the proxy
  re-download the whole content. It could instead keep some of the 
  object files around when the client does not send SUCCESS. This would
-  use more disk, but without streaming, proxying a special remote already
-  needs some disk. And it could minimize to eg, the last 2 or so.
+  use more disk, but could minimize to eg, the last 2 or so.
  The design doc has some more thoughts about this.

-* Streaming download from proxied special remotes. See design.
-  (Planned for September)
-
-* When an upload to a cluster is distributed to multiple special remotes,
-  a temporary file is written for each one, which may even happen in
-  parallel. This is a lot of extra work and may use excess disk space.
-  It should be possible to only write a single temp file.
-  (With streaming this won't be an issue.)
-
-* Indirect uploads when proxying for special remote
-  (to be considered). See design.
-
 * Getting a key from a cluster currently picks from amoung
  the lowest cost remotes at random. This could be smarter,
  eg prefer to avoid using remotes that are doing other transfers at the
@ -179,8 +176,6 @@ Planned schedule of work:
  If seriously tackling this, it might be worth making enough information
  available to use spanning tree protocol for routing inside clusters.

-* Optimise proxy speed. See design for ideas.
-
 * Speed: A proxy to a local git repository spawns git-annex-shell 
  to communicate with it. It would be more efficient to operate
  directly on the Remote. Especially when transferring content to/from it.