This commit is contained in:
Joey Hess 2024-10-22 11:09:47 -04:00
parent 8baccda98f
commit 7dde035ac8
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 42 additions and 29 deletions

View file

@ -687,7 +687,7 @@ provides to the client.
An example use case involves
[presigned S3 urls](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html).
When one of the proxy's nodes is a S3 bucket, having the client upload
When the proxy is to a S3 bucket, having the client upload
directly to S3 would avoid needing double traffic through the proxy's
network.
@ -695,15 +695,33 @@ This would need a special remote that generates the presigned S3 url.
Probably an external, so the external special remote protocol would need to
be updated as well as the P2P protocol.
Since an upload to a proxy can be distributed to multiple nodes, should
the proxy be able to indicate more than one url that the client
should upload to? Also the proxy might want an upload to still be sent to
Since an upload to a cluster can be distributed to multiple nodes, should
it be able to indicate more than one url that the client
should upload to? Also the cluster might want an upload to still be sent to
it in addition to url(s). Of course the downside is that the client would
need to upload more than once, which eliminates one benefit of the proxy.
So it might be reasonable to only support one url, but what if the proxy
has multiple remotes that want to provide urls, how does it pick which one
wins?
need to upload more than once, which eliminates one benefit of the cluster.
> Seems reasonable to only allow this to specify 1 url for the client to
> upload to. If a cluster has several remotes that can use urls, it would
> need to pick 1, or it would need to have the client upload to it, and
> distribute it to the multiple nodes.
Is only an URL enough for the client to be able to upload to wherever? It
may be that the HTTP verb is also necessary. Consider POST vs PUT. Some
services might need additional HTTP headers.
S3 can optionally verify the upload of a presigned url by using
the Content-MD5 header. The right md5 would not be known when generating a
presigned url, unless the key happened to by an md5 key. The client could
hash the content and fill in an md5 in a template. Added complixity in this
particular case does not seem likely to be worthwhile. git-annex does not
usually have S3 verify the checksum.
S3 also supports using POST from a web browser, which is similar to a
presigned url:
<https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-UsingHTTPPOST.html>
This does have a bunch of headers but also uses `multipart/form-data`,
so just dumping the file into the body won't work.
Seems unneccessary to support since javascript should be able to access
the file that the user has selected to upload, and PUT its content to the
presigned url.

View file

@ -26,13 +26,23 @@ Planned schedule of work:
[[!tag projects/openneuro]]
## work notes
## remaining things to do
* Currently working on streaming special remotes via proxy
in the `streamproxy` branch.
* Streaming uploads to special remotes via the proxy. Possibly; if a
workable design can be developed. It seems difficult without changing the
external special remote protocol, unless a fifo is used. Make ORDERED
response in p2p protocol allow using a fifo?
* Downloads from special remotes can stream (though using a temp file on
the proxy). Next: Streaming uploads via the proxy.
* Indirect uploads when proxying for special remote is an alternative that
would work for OpenNeuro's use case.
* If not implementing upload streaming to proxied special remotes,
this needs to be addressed:
When an upload to a cluster is distributed to multiple special remotes,
a temporary file is written for each one, which may even happen in
parallel. This is a lot of extra work and may use excess disk space.
It should be possible to only write a single temp file.
(With streaming this wouldn't be an issue.)
## completed items for October's work on streaming through proxy to special remotes
@ -146,22 +156,9 @@ Planned schedule of work:
* Resuming an interrupted download from proxied special remote makes the proxy
re-download the whole content. It could instead keep some of the
object files around when the client does not send SUCCESS. This would
use more disk, but without streaming, proxying a special remote already
needs some disk. And it could minimize to eg, the last 2 or so.
use more disk, but could minimize to eg, the last 2 or so.
The design doc has some more thoughts about this.
* Streaming download from proxied special remotes. See design.
(Planned for September)
* When an upload to a cluster is distributed to multiple special remotes,
a temporary file is written for each one, which may even happen in
parallel. This is a lot of extra work and may use excess disk space.
It should be possible to only write a single temp file.
(With streaming this won't be an issue.)
* Indirect uploads when proxying for special remote
(to be considered). See design.
* Getting a key from a cluster currently picks from amoung
the lowest cost remotes at random. This could be smarter,
eg prefer to avoid using remotes that are doing other transfers at the
@ -179,8 +176,6 @@ Planned schedule of work:
If seriously tackling this, it might be worth making enough information
available to use spanning tree protocol for routing inside clusters.
* Optimise proxy speed. See design for ideas.
* Speed: A proxy to a local git repository spawns git-annex-shell
to communicate with it. It would be more efficient to operate
directly on the Remote. Especially when transferring content to/from it.