Merge branch 's3-aws'

2014-12-03 14:02:29 -04:00 · 2014-12-03 14:02:29 -04:00 · 911ba8d972
commit 911ba8d972
parent c994d73fcf 748e403fed
12 changed files with 493 additions and 225 deletions
--- a/doc/bugs/S3_memory_leaks.mdwn
+++ b/doc/bugs/S3_memory_leaks.mdwn
@ -2,9 +2,13 @@ S3 has memory leaks

 Sending a file to S3 causes a slow memory increase toward the file size.

+> This is fixed, now that it uses aws. --[[Joey]]
+
 Copying the file back from S3 causes a slow memory increase toward the
 file size.

+> [[fixed|done]] too! --[[Joey]]
+
 The author of hS3 is aware of the problem, and working on it. I think I
 have identified the root cause of the buffering; it's done by hS3 so it can
 resend the data if S3 sends it a 307 redirect. --[[Joey]]
--- a/doc/bugs/S3_upload_not_using_multipart.mdwn
+++ b/doc/bugs/S3_upload_not_using_multipart.mdwn
@ -52,3 +52,11 @@ Please provide any additional information below.
 	upgrade supported from repository versions: 0 1 2

 [[!tag confirmed]]
+
+> [[fixed|done]] This is now supported, when git-annex is built with a new
+> enough version of the aws library. You need to configure the remote to
+> use an appropriate value for multipart, eg:
+> 
+> git annex enableremote cloud multipart=1GiB
+> 
+> --[[Joey]]
--- a/doc/bugs/new_AWS_region_40eu-central-141.mdwn
+++ b/doc/bugs/new_AWS_region_40eu-central-141.mdwn
@ -6,3 +6,5 @@ Amazon has opened up a new region in AWS with a datacenter in Frankfurt/Germany.
 * Region: eu-central-1

 This should be added to the "Adding an Amazon S3 repository" page in the Datacenter dropdown of the webapp.
+
+> [[fixed|done]] --[[Joey]]
--- a/doc/special_remotes/S3.mdwn
+++ b/doc/special_remotes/S3.mdwn
@ -18,11 +18,11 @@ the S3 remote.
 * `encryption` - One of "none", "hybrid", "shared", or "pubkey".
  See [[encryption]].

+* `keyid` - Specifies the gpg key to use for [[encryption]].
+
 * `chunk` - Enables [[chunking]] when storing large files.
  `chunk=1MiB` is a good starting point for chunking.

-* `keyid` - Specifies the gpg key to use for [[encryption]].
-
 * `embedcreds` - Optional. Set to "yes" embed the login credentials inside
  the git repository, which allows other clones to also access them. This is
  the default when gpg encryption is enabled; the credentials are stored
@ -33,7 +33,8 @@ the S3 remote.
  embedcreds without gpg encryption.

 * `datacenter` - Defaults to "US". Other values include "EU",
-  "us-west-1", and "ap-southeast-1".
+  "us-west-1", "us-west-2", "ap-southeast-1", "ap-southeast-2", and
+  "sa-east-1".

 * `storageclass` - Default is "STANDARD". If you have configured git-annex
  to preserve multiple [[copies]], consider setting this to "REDUCED_REDUNDANCY"
@ -46,11 +47,24 @@ the S3 remote.
  so by default, a bucket name is chosen based on the remote name
  and UUID. This can be specified to pick a bucket name.

+* `partsize` - Amazon S3 only accepts uploads up to a certian file size,
+  and storing larger files requires a multipart upload process.
+
+  Setting `partsize=1GiB` is recommended for Amazon S3 when not using
+  chunking; this will cause multipart uploads to be done using parts
+  up to 1GiB in size. Note that setting partsize to less than 100MiB
+  will cause Amazon S3 to reject uploads.
+
+  This is not enabled by default, since other S3 implementations may
+  not support multipart uploads or have different limits,
+  but can be enabled or changed at any time.
+  time.
+
 * `fileprefix` - By default, git-annex places files in a tree rooted at the
  top of the S3 bucket. When this is set, it's prefixed to the filenames
  used. For example, you could set it to "foo/" in one special remote,
  and to "bar/" in another special remote, and both special remotes could
  then use the same bucket.

-* `x-amz-*` are passed through as http headers when storing keys
+* `x-amz-meta-*` are passed through as http headers when storing keys
  in S3.
--- a/doc/todo/S3_multipart_interruption_cleanup.mdwn
+++ b/doc/todo/S3_multipart_interruption_cleanup.mdwn
@ -0,0 +1,14 @@
+When a multipart S3 upload is being made, and gets interrupted,
+the parts remain in the bucket, and S3 may charge for them.
+
+I am not sure what happens if the same object gets uploaded again. Is S3
+nice enough to remove the old parts? I need to find out..
+
+If not, this needs to be dealt with somehow. One way would be to configure an
+expiry of the uploaded parts, but this is tricky as a huge upload could
+take arbitrarily long. Another way would be to record the uploadid and the
+etags of the parts, and then resume where it left off the next time the
+object is sent to S3. (Or at least cancel the old upload; resume isn't
+practical when uploading an encrypted object.) 
+
+It could store that info in either the local FS or the git-annex branch.