planning
This commit is contained in:
parent
8baccda98f
commit
7dde035ac8
2 changed files with 42 additions and 29 deletions
|
@ -687,7 +687,7 @@ provides to the client.
|
|||
|
||||
An example use case involves
|
||||
[presigned S3 urls](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html).
|
||||
When one of the proxy's nodes is a S3 bucket, having the client upload
|
||||
When the proxy is to a S3 bucket, having the client upload
|
||||
directly to S3 would avoid needing double traffic through the proxy's
|
||||
network.
|
||||
|
||||
|
@ -695,15 +695,33 @@ This would need a special remote that generates the presigned S3 url.
|
|||
Probably an external, so the external special remote protocol would need to
|
||||
be updated as well as the P2P protocol.
|
||||
|
||||
Since an upload to a proxy can be distributed to multiple nodes, should
|
||||
the proxy be able to indicate more than one url that the client
|
||||
should upload to? Also the proxy might want an upload to still be sent to
|
||||
Since an upload to a cluster can be distributed to multiple nodes, should
|
||||
it be able to indicate more than one url that the client
|
||||
should upload to? Also the cluster might want an upload to still be sent to
|
||||
it in addition to url(s). Of course the downside is that the client would
|
||||
need to upload more than once, which eliminates one benefit of the proxy.
|
||||
So it might be reasonable to only support one url, but what if the proxy
|
||||
has multiple remotes that want to provide urls, how does it pick which one
|
||||
wins?
|
||||
need to upload more than once, which eliminates one benefit of the cluster.
|
||||
|
||||
> Seems reasonable to only allow this to specify 1 url for the client to
|
||||
> upload to. If a cluster has several remotes that can use urls, it would
|
||||
> need to pick 1, or it would need to have the client upload to it, and
|
||||
> distribute it to the multiple nodes.
|
||||
|
||||
Is only an URL enough for the client to be able to upload to wherever? It
|
||||
may be that the HTTP verb is also necessary. Consider POST vs PUT. Some
|
||||
services might need additional HTTP headers.
|
||||
|
||||
S3 can optionally verify the upload of a presigned url by using
|
||||
the Content-MD5 header. The right md5 would not be known when generating a
|
||||
presigned url, unless the key happened to by an md5 key. The client could
|
||||
hash the content and fill in an md5 in a template. Added complixity in this
|
||||
particular case does not seem likely to be worthwhile. git-annex does not
|
||||
usually have S3 verify the checksum.
|
||||
|
||||
S3 also supports using POST from a web browser, which is similar to a
|
||||
presigned url:
|
||||
<https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-UsingHTTPPOST.html>
|
||||
This does have a bunch of headers but also uses `multipart/form-data`,
|
||||
so just dumping the file into the body won't work.
|
||||
Seems unneccessary to support since javascript should be able to access
|
||||
the file that the user has selected to upload, and PUT its content to the
|
||||
presigned url.
|
||||
|
|
|
@ -26,13 +26,23 @@ Planned schedule of work:
|
|||
|
||||
[[!tag projects/openneuro]]
|
||||
|
||||
## work notes
|
||||
## remaining things to do
|
||||
|
||||
* Currently working on streaming special remotes via proxy
|
||||
in the `streamproxy` branch.
|
||||
* Streaming uploads to special remotes via the proxy. Possibly; if a
|
||||
workable design can be developed. It seems difficult without changing the
|
||||
external special remote protocol, unless a fifo is used. Make ORDERED
|
||||
response in p2p protocol allow using a fifo?
|
||||
|
||||
* Downloads from special remotes can stream (though using a temp file on
|
||||
the proxy). Next: Streaming uploads via the proxy.
|
||||
* Indirect uploads when proxying for special remote is an alternative that
|
||||
would work for OpenNeuro's use case.
|
||||
|
||||
* If not implementing upload streaming to proxied special remotes,
|
||||
this needs to be addressed:
|
||||
When an upload to a cluster is distributed to multiple special remotes,
|
||||
a temporary file is written for each one, which may even happen in
|
||||
parallel. This is a lot of extra work and may use excess disk space.
|
||||
It should be possible to only write a single temp file.
|
||||
(With streaming this wouldn't be an issue.)
|
||||
|
||||
## completed items for October's work on streaming through proxy to special remotes
|
||||
|
||||
|
@ -146,22 +156,9 @@ Planned schedule of work:
|
|||
* Resuming an interrupted download from proxied special remote makes the proxy
|
||||
re-download the whole content. It could instead keep some of the
|
||||
object files around when the client does not send SUCCESS. This would
|
||||
use more disk, but without streaming, proxying a special remote already
|
||||
needs some disk. And it could minimize to eg, the last 2 or so.
|
||||
use more disk, but could minimize to eg, the last 2 or so.
|
||||
The design doc has some more thoughts about this.
|
||||
|
||||
* Streaming download from proxied special remotes. See design.
|
||||
(Planned for September)
|
||||
|
||||
* When an upload to a cluster is distributed to multiple special remotes,
|
||||
a temporary file is written for each one, which may even happen in
|
||||
parallel. This is a lot of extra work and may use excess disk space.
|
||||
It should be possible to only write a single temp file.
|
||||
(With streaming this won't be an issue.)
|
||||
|
||||
* Indirect uploads when proxying for special remote
|
||||
(to be considered). See design.
|
||||
|
||||
* Getting a key from a cluster currently picks from amoung
|
||||
the lowest cost remotes at random. This could be smarter,
|
||||
eg prefer to avoid using remotes that are doing other transfers at the
|
||||
|
@ -179,8 +176,6 @@ Planned schedule of work:
|
|||
If seriously tackling this, it might be worth making enough information
|
||||
available to use spanning tree protocol for routing inside clusters.
|
||||
|
||||
* Optimise proxy speed. See design for ideas.
|
||||
|
||||
* Speed: A proxy to a local git repository spawns git-annex-shell
|
||||
to communicate with it. It would be more efficient to operate
|
||||
directly on the Remote. Especially when transferring content to/from it.
|
||||
|
|
Loading…
Reference in a new issue