revamp s3 design

looking very doable now
2011-03-27 17:45:11 -04:00 · 2011-03-27 17:45:11 -04:00 · fb47e88404
commit fb47e88404
parent f8693facab
1 changed files with 25 additions and 78 deletions
--- a/doc/todo/S3.mdwn
+++ b/doc/todo/S3.mdwn
@ -2,90 +2,37 @@ Support Amazon S3 as a file storage backend.
 There's a haskell library that looks good. Not yet in Debian.
-Multiple ways of using S3 are possible. Current plan is to have a S3BUCKET
+Multiple ways of using S3 are possible. Current plan is to 
-backend, that is derived from Backend.File, so it caches files locally and
+have a special type of git remote (though git won't know how to use it;
-can transfer files between systems too, without involving S3.
+only git-annex will) that uses a S3 bucket.
-get will try to get it from S3 or from a remote. A annex.s3.cost can
+Something like:
 configure the cost of S3 vs the cost of other remotes.
-add will always upload a copy to S3.
+	[remote "s3"]
 		annex-s3bucket = mybucket
 		annex-s3datacenter = Europe
 		annex-uuid = 1a586cf6-45e9-11e0-ba9c-3b0a3397aec2
 		annex-cost = 500
-Each file in the S3 bucket is assumed to be in the annex. So unused
+The UUID will be stored as a special file in the S3 bucket.
 will show files in the bucket that nothing points to, and dropunused remove
 them.
-For numcopies counting, S3 will count as 1 copy (or maybe more?), so if
+Using a different type of remote like this will allow S3 to be used
-numcopies=2, then you don't fully trust S3 and request git-annex assure
+anywhere a regular remote would be used. `git annex get` will transparently
-one other copy.
+download a file from S3 if S3 has it and is the cheapest remote.
-drop will remove a file locally, but keep it in S3. drop --force *might*
+	git annex copy --to s3
-remove it from S3. TBD.
+	git annex move --from s3
 	git annex drop --from s3 # not currently allowed, will need adding
-annex.s3.bucket would configure the bucket the use. (And an env var or
+Each s3 remote will count as one copy for numcopies handling, just like
-something configure the password.) Although the bucket
+any other remote.
 would also be encoded in the keys. So, the configured bucket would be used
 when adding new files. A system could move from one bucket to another over
 time while still having legacy files in an earlier one; 
 perhaps you move to Europe and want new files to be put in that region.
-And git annex `migrate --backend=S3BUCKET --force` could move files
+## unused checking
 between datacenters!
-Problem: Then the only way for unused to know what buckets are in use
+One problem is `git annex unused`. Currently it only looks at the local
-is to see what keys point to them -- but if the last file from a bucket is
+repository, not remotes. But if something is dropped from the local repo,
-deleted, it would then not be able to say that the files in that bucket are
+and you forget to drop it from S3, cruft can build up there.
 all unused. Need cached list of recently seen S3 buckets?
-----
+This could be fixed by adding a hook to list all keys present in a remote.
-
+Then unused could scan remotes for keys, and if they were not used locally,
-One problem with this is what key metadata to include. Should it be like
+offer the possibility to drop them from the remote.
 WORM? Or like SHA1? Or just a new unique identifier for each file? It might
 be worth having S3 variants of *all* the Backend.File derived backends.
 More blue-sky, it might be nice to be able to union or stack together
 multiple backends, so S3BUCKET+SHA1 or S3BUCKET+WORM. That would likely
 be hard to get right.
 Less blue-sky, if the S3 capability were added directly to Backend.File,
 and bucket name was configured by annex.s3.bucket, then any existing
 annexed file could be upgraded to also store on S3.
 ## alternate approach
 The above assumes S3 should be a separate backend somehow. What if,
 instead a S3 bucket is treated as a separate **remote**.
 * Could "git annex add" while offline, and "git annex push --to S3" when
  online.
 * No need to choose whether a file goes to S3 at add time; no need to
  migrate to move files there.
 * numcopies counting Just Works
 * Could have multiple S3 buckets as desired.
 The bucket name could 1:1 map with its annex.uuid, so not much
 configuration would be needed when cloning a repo to get it using S3 --
 just configure the S3 access token(s) to use for various UUIDs.
 Implementing this might not be as conceptually nice as making S3 a separate
 backend. It would need some changes to the remotes code, perhaps lifting
 some of it into backend-specific hooks. Then the S3 backend could be
 implicitly stacked in front of a backend like WORM.
 ---
 Maybe the right way to look at this is that a list of Stores
 should be a property of the Backend. Backend.File is a Backend, that
 uses various Stores, which can be of different types (the local
 git repo, remote git repos, S3, etc). Backend.URL is a backend that uses
 other Stores (the local git repo, and the web).
 Operations on Stores are:
 * uuid -- each store has a unique uuid value
 * cost -- each store has a use cost value
 * getConfig -- attempts to look up values (uuid, possibly cost)
 * copyToStore -- store a file's contents to a key
 * copyFromStore -- retrieve a key's contents to a file
 * removeFromStore -- removes a key's contents from the store
 * hasKey -- checks if the key's content is available