4 designs for better chunking

Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman.
2014-07-23 17:01:30 -04:00 · 2014-07-23 17:01:30 -04:00 · 20627e9fab
commit 20627e9fab
parent 5262e6a372
1 changed files with 177 additions and 0 deletions
--- a/doc/design/assistant/chunks.mdwn
+++ b/doc/design/assistant/chunks.mdwn
@ -5,3 +5,180 @@ May be a useful starting point for [[deltas]].

 May also allow for downloading different chunks of a file concurrently from
 multiple remotes.
+
+# currently
+
+Currently, only the webdav and directory special remotes support chunking.
+
+Filenames are used for the chunks that make it easy to see which chunks
+belong together, even when encryption is used. There is also a chunkcount
+file, that similarly leaks information.
+
+It is not currently possible to enable chunking on a non-chunked remote.
+
+Problem: Two uploads of the same key from repos with different chunk sizes
+could lead to data loss. For example, suppose A is 10 mb, and B is 20 mb,
+and the upload speed is the same. If B starts first, when A will overwrite
+the file it is uploading for the 1st chunk. Then A uploads the second
+chunk, and once A is done, B finishes the 1st chunk and uploads its second.
+We now have 1(from A), 2(from B).
+
+This needs to be supported for back-compat, so keep the chunksize= setting
+to enable that mode, and add a new setting for the new mode.
+
+# new requirements
+
+Every special remote should support chunking. (It does not make sense
+to support it for git remotes, but gcrypt remotes should support it.)
+
+S3 remotes should chunk by default, because the current S3 backend fails
+for files past a certian size. See [[bugs/]]
+
+The size of chunks, as well as whether any chunking is done, should be
+configurable on the fly without invalidating data already stored in the
+remote. This seems important for usability (eg, so users can turn chunking
+on in the webapp when configuring an existing remote).
+
+Two concurrent uploaders of the same object to a remote should be safe,
+even if they're using different chunk sizes.
+
+# obscuring file sizes
+
+To hide from a remote any information about the sizes of files could be
+another goal of chunking. At least two things are needed for this:
+
+1. The filenames used on the remote don't indicate which chunks belong
+   together.
+
+2. The final short chunk needs to be padded with random data,
+   so that a remote sees only encrypted files with uniform sizes
+   and cannot make guesses about the kinds of data being stored.
+
+Note that encrypting the whole file and then chunking and padding it is not
+good because the remote can probably examine files and tell when a gpg
+stream has been cut into peices, even without the key (have not verified
+this, but it seems likely; certianly gpg magic numbers can identify gpg
+encrypted files so a file that's encrypted but lacks the magic is not the
+first chunk..).
+
+Note that padding cannot completely hide all information from an attacker
+who is logging puts or gets. An attacker could, for example, look at the
+times of puts, and guess at when git-annex has moved on to
+encrypting/decrypting the next object, and so guess at the approximate
+sizes of objects. (Concurrent uploads/downloads or random delays could be
+added to prevent these kinds of attacks.)
+
+And, obviously, if someone stores 10 tb of data in a remote, they probably
+have around 10 tb of files, so it's probably not a collection of recipes..
+
+Given its inneficiencies and lack of fully obscuring file sizes,
+padding may not be worth adding.
+
+# design 1
+
+Add an optional chunk field to Key. It is only present for chunk
+2 and above. Ie, SHA256-s12345--xxxxxxx is the first chunk (or whole
+object), while SHA256-s12345-c2--xxxxxxx is the second chunk.
+
+On an encrypted remote, Keys are generated with the chunk field, and then
+HMAC enrypted.
+
+Note that only using it for chunks 2+ means that git-annex can start by
+requesting the regular key, so an observer sees the same request whether
+chunked or not, and does not see eg, a pattern of failed requests for 
+a non-chunked key, followed by successful requests for the chunked keys.
+(Both more efficient and perhaps more secure.)
+
+Problem: This makes putting chunks easy. But there is a problem when getting 
+an object that has been chunked. If the key size is not known, we
+cannot tell when we've gotten the last chunk. (Also, we cannot strip
+padding.) Note that `addurl` sometimes generates keys w/o size info
+(particularly, it does so by design when using quvi).
+
+Problem: Also, this makes `hasKey` hard to implement: How can it know if
+all the chunks are present, if the key size is not known?
+
+Problem: Also, this makes it difficult to download encrypted keys, because
+we only know the decrypted size, not the encrypted size, so we can't
+be sure how many chunks to get, and all chunks need to be downloaded before
+we can decrypt any of them.
+
+Problem: Does not solve concurrent uploads with different chunk sizes.
+
+# design 2
+
+When chunking is enabled, always put a chunk number in the Key,
+along with the chunk size.
+So, SHA256-s10000-c1--xxxxxxx for the first chunk of 1 megabyte.
+
+Before any chunks are stored, write a chunkcount file, eg
+SHA256-s12345-c0--xxxxxxx. Note that this key is the same as the original
+object's key, except with chunk number set to 0. This file contains both
+the number of chunks, and also the chunk size used. `hasKey` downloads this
+file, and then verifies that each chunk is present, looking for keys with
+the expected chunk numbers and chunk size.
+
+This avoids problems with multiple writers using different chunk sizes,
+since they will be uploading to different files.
+
+Problem: In such a situation, some duplicate data might be stored, not
+referenced by the last chunkcount file to be written. It would not be
+dropped when the key was removed from the remote.
+
+Note: This design lets an attacker with logs tell the (appoximate) size of
+objects, by finding the small files that contain a chunk count, and
+correlating when that is written/read and when other files are
+written/read. That could be solved by padding the chunkcount key up to the
+size of the rest of the keys, but that's very innefficient; `hasKey` is not
+designed to need to download large files.
+
+# design 3
+
+Like design 1, but add an encrypted chunk count prefix to the first object.
+This needs to be done in a way that does not let an attacker tell if the
+object has an encrypted chunk count prefix or not. 
+
+This seems difficult; attacker could probably tell where the first encrypted
+part stops and the next encrypted part starts by looking for gpg headers,
+and so tell which files are the first chunks.
+
+Also, `hasKey` would need to download some or all of the first file.
+If all, that's a lot more expensive. If only some is downloaded, an
+attacker can guess that the file that was partially downloaded is the
+first chunk in a series, and wait for a time when it's fully downloaded to
+determine which are the other chunks.
+
+Problem: Two uploads of the same key from repos with different chunk sizes
+could lead to data loss. (Same as in design 2.)
+
+# design 4
+
+Instead of storing the chunk count in the special remote, store it in 
+the git-annex branch. 
+
+So, use key SHA256-s10000-c1--xxxxxxx for the first chunk of 1 megabyte.
+
+And look at git-annex:aaa/bbb/SHA256-s12345--xxxxxxx.log.cnk to get the 
+chunk count and size. File format would be:
+
+	ts uuid  chunksize chunkcount
+
+Note that a given remote uuid might have multiple lines, if a key was
+stored on it twice using different chunk sizes. Also note that even when
+this file exists for a key, the object may be stored non-chunked on the
+remote too.
+
+`hasKey` would check if any one (chunksize, chunkcount) is satisfied by
+the files on the remote. It would also check if the non-chunked key is
+present.
+
+When dropping a key from the remote, drop all logged chunk sizes.
+As long as the location log and the new log are committed atomically,
+this guarantees that no orphaned chunks end up on a remote
+(except any that might be left by interrupted uploads).
+(Also drop any non-chunked key.)
+
+This has the best security of the designs so far, because the special 
+remote doesn't know anything about chunk sizes. It uses a little more
+data in the git-annex branch, although with care (using the same timestamp
+as the location log), it can compress pretty well.