Merge branch 's3-aws'

This commit is contained in:
Joey Hess 2014-12-03 14:02:29 -04:00
commit 911ba8d972
12 changed files with 493 additions and 225 deletions

View file

@ -2,9 +2,13 @@ S3 has memory leaks
Sending a file to S3 causes a slow memory increase toward the file size.
> This is fixed, now that it uses aws. --[[Joey]]
Copying the file back from S3 causes a slow memory increase toward the
file size.
> [[fixed|done]] too! --[[Joey]]
The author of hS3 is aware of the problem, and working on it. I think I
have identified the root cause of the buffering; it's done by hS3 so it can
resend the data if S3 sends it a 307 redirect. --[[Joey]]

View file

@ -52,3 +52,11 @@ Please provide any additional information below.
upgrade supported from repository versions: 0 1 2
[[!tag confirmed]]
> [[fixed|done]] This is now supported, when git-annex is built with a new
> enough version of the aws library. You need to configure the remote to
> use an appropriate value for multipart, eg:
>
> git annex enableremote cloud multipart=1GiB
>
> --[[Joey]]

View file

@ -6,3 +6,5 @@ Amazon has opened up a new region in AWS with a datacenter in Frankfurt/Germany.
* Region: eu-central-1
This should be added to the "Adding an Amazon S3 repository" page in the Datacenter dropdown of the webapp.
> [[fixed|done]] --[[Joey]]

View file

@ -18,11 +18,11 @@ the S3 remote.
* `encryption` - One of "none", "hybrid", "shared", or "pubkey".
See [[encryption]].
* `keyid` - Specifies the gpg key to use for [[encryption]].
* `chunk` - Enables [[chunking]] when storing large files.
`chunk=1MiB` is a good starting point for chunking.
* `keyid` - Specifies the gpg key to use for [[encryption]].
* `embedcreds` - Optional. Set to "yes" embed the login credentials inside
the git repository, which allows other clones to also access them. This is
the default when gpg encryption is enabled; the credentials are stored
@ -33,7 +33,8 @@ the S3 remote.
embedcreds without gpg encryption.
* `datacenter` - Defaults to "US". Other values include "EU",
"us-west-1", and "ap-southeast-1".
"us-west-1", "us-west-2", "ap-southeast-1", "ap-southeast-2", and
"sa-east-1".
* `storageclass` - Default is "STANDARD". If you have configured git-annex
to preserve multiple [[copies]], consider setting this to "REDUCED_REDUNDANCY"
@ -46,11 +47,24 @@ the S3 remote.
so by default, a bucket name is chosen based on the remote name
and UUID. This can be specified to pick a bucket name.
* `partsize` - Amazon S3 only accepts uploads up to a certian file size,
and storing larger files requires a multipart upload process.
Setting `partsize=1GiB` is recommended for Amazon S3 when not using
chunking; this will cause multipart uploads to be done using parts
up to 1GiB in size. Note that setting partsize to less than 100MiB
will cause Amazon S3 to reject uploads.
This is not enabled by default, since other S3 implementations may
not support multipart uploads or have different limits,
but can be enabled or changed at any time.
time.
* `fileprefix` - By default, git-annex places files in a tree rooted at the
top of the S3 bucket. When this is set, it's prefixed to the filenames
used. For example, you could set it to "foo/" in one special remote,
and to "bar/" in another special remote, and both special remotes could
then use the same bucket.
* `x-amz-*` are passed through as http headers when storing keys
* `x-amz-meta-*` are passed through as http headers when storing keys
in S3.

View file

@ -0,0 +1,14 @@
When a multipart S3 upload is being made, and gets interrupted,
the parts remain in the bucket, and S3 may charge for them.
I am not sure what happens if the same object gets uploaded again. Is S3
nice enough to remove the old parts? I need to find out..
If not, this needs to be dealt with somehow. One way would be to configure an
expiry of the uploaded parts, but this is tricky as a huge upload could
take arbitrarily long. Another way would be to record the uploadid and the
etags of the parts, and then resume where it left off the next time the
object is sent to S3. (Or at least cancel the old upload; resume isn't
practical when uploading an encrypted object.)
It could store that info in either the local FS or the git-annex branch.