git-annex/doc/design/assistant/chunks.mdwn

To avoid leaking even the size of your encrypted files to cloud storage
providers, add a mode that stores fixed size chunks.

May be a useful starting point for [[deltas]].

May also allow for downloading different chunks of a file concurrently from
multiple remotes.

# currently

Currently, only the webdav and directory special remotes support chunking.

Filenames are used for the chunks that make it easy to see which chunks
belong together, even when encryption is used. There is also a chunkcount
file, that similarly leaks information.

It is not currently possible to enable chunking on a non-chunked remote.

Problem: Two uploads of the same key from repos with different chunk sizes
could lead to data loss. For example, suppose A is 10 mb chunksize, and B
is 20 mb, and the upload speed is the same. If B starts first, when A will
overwrite the file it is uploading for the 1st chunk. Then A uploads the
second chunk, and once A is done, B finishes the 1st chunk and uploads its
second. We now have [chunk 1(from A), chunk 2(from B)].

# new requirements

Every special remote should support chunking. (It does not make sense
to support it for git remotes, but gcrypt remotes should support it.)

S3 remotes should chunk by default, because the current S3 backend fails
for files past a certian size. See [[bugs/Truncated_file_transferred_via_S3]].

The size of chunks, as well as whether any chunking is done, should be
configurable on the fly without invalidating data already stored in the
remote. This seems important for usability (eg, so users can turn chunking
on in the webapp when configuring an existing remote).

Two concurrent uploaders of the same object to a remote should be safe,
even if they're using different chunk sizes.

The old chunk method needs to be supported for back-compat, so
keep the chunksize= setting to enable that mode, and add a new setting
for the new mode.

# obscuring file sizes

To hide from a remote any information about the sizes of files could be
another goal of chunking. At least two things are needed for this:

1. The filenames used on the remote don't indicate which chunks belong
   together.

2. The final short chunk needs to be padded with random data,
   so that a remote sees only encrypted files with uniform sizes
   and cannot make guesses about the kinds of data being stored.

Note that padding cannot completely hide all information from an attacker
who is logging puts or gets. An attacker could, for example, look at the
times of puts, and guess at when git-annex has moved on to
encrypting/decrypting the next object, and so guess at the approximate
sizes of objects. (Concurrent uploads/downloads or random delays could be
added to prevent these kinds of attacks.)

And, obviously, if someone stores 10 tb of data in a remote, they probably
have around 10 tb of files, so it's probably not a collection of recipes..

Given its inneficiencies and lack of fully obscuring file sizes,
padding may not be worth adding, but is considered in the designs below.

# design 1

Add an optional chunk field to Key. It is only present for chunk
2 and above. Ie, SHA256-s12345--xxxxxxx is the first chunk (or whole
object), while SHA256-s12345-c2--xxxxxxx is the second chunk.

On an encrypted remote, Keys are generated with the chunk field, and then
HMAC enrypted.

Note that only using it for chunks 2+ means that git-annex can start by
requesting the regular key, so an observer sees the same request whether
chunked or not, and does not see eg, a pattern of failed requests for 
a non-chunked key, followed by successful requests for the chunked keys.
(Both more efficient and perhaps more secure.)

Problem: This makes putting chunks easy. But there is a problem when getting 
an object that has been chunked. If the key size is not known, we
cannot tell when we've gotten the last chunk. (Also, we cannot strip
padding.) Note that `addurl` sometimes generates keys w/o size info
(particularly, it does so by design when using quvi).

Problem: Also, this makes `hasKey` hard to implement: How can it know if
all the chunks are present, if the key size is not known?

Problem: Also, this makes it difficult to download encrypted keys, because
we only know the decrypted size, not the encrypted size, so we can't
be sure how many chunks to get, and all chunks need to be downloaded before
we can decrypt any of them. (Assuming we encrypt first; chunking first
avoids this problem.)

Problem: Does not solve concurrent uploads with different chunk sizes.

# design 2

When chunking is enabled, always put a chunk number in the Key,
along with the chunk size.
So, SHA256-1048576-c1--xxxxxxx for the first chunk of 1 megabyte.

Before any chunks are stored, write a chunkcount file, eg
SHA256-s12345-c0--xxxxxxx. Note that this key is the same as the original
object's key, except with chunk number set to 0. This file contains both
the number of chunks, and also the chunk size used. `hasKey` downloads this
file, and then verifies that each chunk is present, looking for keys with
the expected chunk numbers and chunk size.

This avoids problems with multiple writers using different chunk sizes,
since they will be uploading to different files.

Problem: In such a situation, some duplicate data might be stored, not
referenced by the last chunkcount file to be written. It would not be
dropped when the key was removed from the remote.

Note: This design lets an attacker with logs tell the (appoximate) size of
objects, by finding the small files that contain a chunk count, and
correlating when that is written/read and when other files are
written/read. That could be solved by padding the chunkcount key up to the
size of the rest of the keys, but that's very innefficient; `hasKey` is not
designed to need to download large files.

# design 3

Like design 1, but add an encrypted chunk count prefix to the first object.
This needs to be done in a way that does not let an attacker tell if the
object has an encrypted chunk count prefix or not. 

This seems difficult; attacker could probably tell where the first encrypted
part stops and the next encrypted part starts by looking for gpg headers,
and so tell which files are the first chunks.

Also, `hasKey` would need to download some or all of the first file.
If all, that's a lot more expensive. If only some is downloaded, an
attacker can guess that the file that was partially downloaded is the
first chunk in a series, and wait for a time when it's fully downloaded to
determine which are the other chunks.

Problem: Two uploads of the same key from repos with different chunk sizes
could lead to data loss. (Same as in design 2.)

# design 4

Use key SHA256-s12345-S1048576-C1--xxxxxxx for the first chunk of 1 megabyte.

Note that keeping the 's'ize field unchanged is necessary because it 
disambiguates eg, WORM keys. So a 'S'ize field is used to hold the chunk
size.

Instead of storing the chunk count in the special remote, store it in 
the git-annex branch. 

The location log does not record locations of individual chunk keys
(too space-inneficient).
Instead, look at git-annex:aaa/bbb/SHA256-s12345--xxxxxxx.log.cnk to get
the chunk count and size for a key.

Note that a given remote uuid might have multiple chunk sizes logged, if a
key was stored on it twice using different chunk sizes. Also note that even
when this file exists for a key, the object may be stored non-chunked on
the remote too.

`hasKey` would check if any one (chunksize, chunkcount) is satisfied by
the files on the remote. It would also check if the non-chunked key is
present, as a fallback.

When dropping a key from the remote, drop all logged chunk sizes.
(Also drop any non-chunked key.)

As long as the location log and the chunk log are committed atomically,
this guarantees that no orphaned chunks end up on a remote
(except any that might be left by interrupted uploads).

This has the best security of the designs so far, because the special 
remote doesn't know anything about chunk sizes. It uses a little more
data in the git-annex branch, although with care (using the same timestamp
as the location log), it can compress pretty well.

## chunk then encrypt

Rather than encrypting the whole object 1st and then chunking, chunk and
then encrypt.

Reasons:

1. If 2 repos are uploading the same key to a remote concurrently,
   this allows some chunks to come from one and some from another,
   and be reassembled without problems.

2. Also allows chunks of the same object to be downloaded from different
   remotes, perhaps concurrently, and again be reassembled without
   problems.

3. Prevents an attacker from re-assembling the chunked file using details
   of the gpg output. Which would expose approximate
   file size even if padding is being used to obscure it.

Note that this means that the chunks won't exactly match the configured
chunk size. gpg does compression, which might make them a
lot smaller. Or gpg overhead could make them slightly larger. So `hasKey`
cannot check exact file sizes.

If padding is enabled, gpg compression should be disabled, to not leak
clues about how well the files compress and so what kind of file it is.

## chunk key hashing

A chunk key should hash into the same directory structure as its parent
key. This will avoid lots of extra hash directories when using chunking
with non-encrypted keys.

Won't happen when the key is encrypted, but that is good; hashing to the
same bucket then would allow statistical correlation.
roadmap for next year's work 2013-07-23 22:46:09 +00:00			`To avoid leaking even the size of your encrypted files to cloud storage`
			`providers, add a mode that stores fixed size chunks.`

			`May be a useful starting point for [[deltas]].`
update 2013-07-26 05:20:09 +00:00
			`May also allow for downloading different chunks of a file concurrently from`
			`multiple remotes.`
4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00
			`# currently`

			`Currently, only the webdav and directory special remotes support chunking.`

			`Filenames are used for the chunks that make it easy to see which chunks`
			`belong together, even when encryption is used. There is also a chunkcount`
			`file, that similarly leaks information.`

			`It is not currently possible to enable chunking on a non-chunked remote.`

			`Problem: Two uploads of the same key from repos with different chunk sizes`
update 2014-07-24 16:41:34 +00:00			`could lead to data loss. For example, suppose A is 10 mb chunksize, and B`
			`is 20 mb, and the upload speed is the same. If B starts first, when A will`
			`overwrite the file it is uploading for the 1st chunk. Then A uploads the`
			`second chunk, and once A is done, B finishes the 1st chunk and uploads its`
			`second. We now have [chunk 1(from A), chunk 2(from B)].`
4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00
			`# new requirements`

			`Every special remote should support chunking. (It does not make sense`
			`to support it for git remotes, but gcrypt remotes should support it.)`

			`S3 remotes should chunk by default, because the current S3 backend fails`
link 2014-07-23 21:56:26 +00:00			`for files past a certian size. See [[bugs/Truncated_file_transferred_via_S3]].`
4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00
			`The size of chunks, as well as whether any chunking is done, should be`
			`configurable on the fly without invalidating data already stored in the`
			`remote. This seems important for usability (eg, so users can turn chunking`
			`on in the webapp when configuring an existing remote).`

			`Two concurrent uploaders of the same object to a remote should be safe,`
			`even if they're using different chunk sizes.`

minor 2014-07-23 21:55:28 +00:00			`The old chunk method needs to be supported for back-compat, so`
			`keep the chunksize= setting to enable that mode, and add a new setting`
			`for the new mode.`

4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00			`# obscuring file sizes`

			`To hide from a remote any information about the sizes of files could be`
			`another goal of chunking. At least two things are needed for this:`

			`1. The filenames used on the remote don't indicate which chunks belong`
			`together.`

			`2. The final short chunk needs to be padded with random data,`
			`so that a remote sees only encrypted files with uniform sizes`
			`and cannot make guesses about the kinds of data being stored.`

			`Note that padding cannot completely hide all information from an attacker`
			`who is logging puts or gets. An attacker could, for example, look at the`
			`times of puts, and guess at when git-annex has moved on to`
			`encrypting/decrypting the next object, and so guess at the approximate`
			`sizes of objects. (Concurrent uploads/downloads or random delays could be`
			`added to prevent these kinds of attacks.)`

			`And, obviously, if someone stores 10 tb of data in a remote, they probably`
			`have around 10 tb of files, so it's probably not a collection of recipes..`

			`Given its inneficiencies and lack of fully obscuring file sizes,`
minor 2014-07-23 21:55:28 +00:00			`padding may not be worth adding, but is considered in the designs below.`
4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00
			`# design 1`

			`Add an optional chunk field to Key. It is only present for chunk`
			`2 and above. Ie, SHA256-s12345--xxxxxxx is the first chunk (or whole`
			`object), while SHA256-s12345-c2--xxxxxxx is the second chunk.`

			`On an encrypted remote, Keys are generated with the chunk field, and then`
			`HMAC enrypted.`

			`Note that only using it for chunks 2+ means that git-annex can start by`
			`requesting the regular key, so an observer sees the same request whether`
			`chunked or not, and does not see eg, a pattern of failed requests for`
			`a non-chunked key, followed by successful requests for the chunked keys.`
			`(Both more efficient and perhaps more secure.)`

			`Problem: This makes putting chunks easy. But there is a problem when getting`
			`an object that has been chunked. If the key size is not known, we`
			`cannot tell when we've gotten the last chunk. (Also, we cannot strip`
			padding.) Note that `addurl` sometimes generates keys w/o size info
			`(particularly, it does so by design when using quvi).`

			Problem: Also, this makes `hasKey` hard to implement: How can it know if
			`all the chunks are present, if the key size is not known?`

			`Problem: Also, this makes it difficult to download encrypted keys, because`
			`we only know the decrypted size, not the encrypted size, so we can't`
			`be sure how many chunks to get, and all chunks need to be downloaded before`
update 2014-07-24 16:41:34 +00:00			`we can decrypt any of them. (Assuming we encrypt first; chunking first`
			`avoids this problem.)`
4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00
			`Problem: Does not solve concurrent uploads with different chunk sizes.`

			`# design 2`

			`When chunking is enabled, always put a chunk number in the Key,`
			`along with the chunk size.`
add chunk metadata to Key Added new fields for chunk number, and chunk size. These will not appear in normal keys ever, but will be used for chunked data stored on special remotes. This commit was sponsored by Jouni K Seppanen. 2014-07-24 17:36:23 +00:00			`So, SHA256-1048576-c1--xxxxxxx for the first chunk of 1 megabyte.`
4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00
			`Before any chunks are stored, write a chunkcount file, eg`
			`SHA256-s12345-c0--xxxxxxx. Note that this key is the same as the original`
			`object's key, except with chunk number set to 0. This file contains both`
			the number of chunks, and also the chunk size used. `hasKey` downloads this
			`file, and then verifies that each chunk is present, looking for keys with`
			`the expected chunk numbers and chunk size.`

			`This avoids problems with multiple writers using different chunk sizes,`
			`since they will be uploading to different files.`

			`Problem: In such a situation, some duplicate data might be stored, not`
			`referenced by the last chunkcount file to be written. It would not be`
			`dropped when the key was removed from the remote.`

			`Note: This design lets an attacker with logs tell the (appoximate) size of`
			`objects, by finding the small files that contain a chunk count, and`
			`correlating when that is written/read and when other files are`
			`written/read. That could be solved by padding the chunkcount key up to the`
			size of the rest of the keys, but that's very innefficient; `hasKey` is not
			`designed to need to download large files.`

			`# design 3`

			`Like design 1, but add an encrypted chunk count prefix to the first object.`
			`This needs to be done in a way that does not let an attacker tell if the`
			`object has an encrypted chunk count prefix or not.`

			`This seems difficult; attacker could probably tell where the first encrypted`
			`part stops and the next encrypted part starts by looking for gpg headers,`
			`and so tell which files are the first chunks.`

			Also, `hasKey` would need to download some or all of the first file.
			`If all, that's a lot more expensive. If only some is downloaded, an`
			`attacker can guess that the file that was partially downloaded is the`
			`first chunk in a series, and wait for a time when it's fully downloaded to`
			`determine which are the other chunks.`

			`Problem: Two uploads of the same key from repos with different chunk sizes`
			`could lead to data loss. (Same as in design 2.)`

			`# design 4`

add chunk metadata to Key Added new fields for chunk number, and chunk size. These will not appear in normal keys ever, but will be used for chunked data stored on special remotes. This commit was sponsored by Jouni K Seppanen. 2014-07-24 17:36:23 +00:00			`Use key SHA256-s12345-S1048576-C1--xxxxxxx for the first chunk of 1 megabyte.`

			`Note that keeping the 's'ize field unchanged is necessary because it`
			`disambiguates eg, WORM keys. So a 'S'ize field is used to hold the chunk`
			`size.`
minor 2014-07-23 21:55:28 +00:00
4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00			`Instead of storing the chunk count in the special remote, store it in`
			`the git-annex branch.`

add chunk metadata to Key Added new fields for chunk number, and chunk size. These will not appear in normal keys ever, but will be used for chunked data stored on special remotes. This commit was sponsored by Jouni K Seppanen. 2014-07-24 17:36:23 +00:00			`The location log does not record locations of individual chunk keys`
			`(too space-inneficient).`
			`Instead, look at git-annex:aaa/bbb/SHA256-s12345--xxxxxxx.log.cnk to get`
implement chunk logs Slightly tricky as they are not normal UUIDBased logs, but are instead maps from (uuid, chunksize) to chunkcount. This commit was sponsored by Frank Thomas. 2014-07-24 20:23:36 +00:00			`the chunk count and size for a key.`
4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00
implement chunk logs Slightly tricky as they are not normal UUIDBased logs, but are instead maps from (uuid, chunksize) to chunkcount. This commit was sponsored by Frank Thomas. 2014-07-24 20:23:36 +00:00			`Note that a given remote uuid might have multiple chunk sizes logged, if a`
			`key was stored on it twice using different chunk sizes. Also note that even`
			`when this file exists for a key, the object may be stored non-chunked on`
			`the remote too.`
4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00
			`hasKey` would check if any one (chunksize, chunkcount) is satisfied by
			`the files on the remote. It would also check if the non-chunked key is`
update 2014-07-24 16:41:34 +00:00			`present, as a fallback.`
4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00
			`When dropping a key from the remote, drop all logged chunk sizes.`
minor 2014-07-23 21:55:28 +00:00			`(Also drop any non-chunked key.)`

update 2014-07-24 16:41:34 +00:00			`As long as the location log and the chunk log are committed atomically,`
4 designs for better chunking Having a hard time finding a way to totally obscure file sizes, but otherwise happy with design #4. This commit was sponsored by Michael Alan Dorman. 2014-07-23 21:01:30 +00:00			`this guarantees that no orphaned chunks end up on a remote`
			`(except any that might be left by interrupted uploads).`

			`This has the best security of the designs so far, because the special`
			`remote doesn't know anything about chunk sizes. It uses a little more`
			`data in the git-annex branch, although with care (using the same timestamp`
			`as the location log), it can compress pretty well.`
chunk then encrypt 2014-07-24 02:38:14 +00:00
			`## chunk then encrypt`

			`Rather than encrypting the whole object 1st and then chunking, chunk and`
			`then encrypt.`

			`Reasons:`

			`1. If 2 repos are uploading the same key to a remote concurrently,`
			`this allows some chunks to come from one and some from another,`
			`and be reassembled without problems.`

update 2014-07-24 16:41:34 +00:00			`2. Also allows chunks of the same object to be downloaded from different`
			`remotes, perhaps concurrently, and again be reassembled without`
			`problems.`

			`3. Prevents an attacker from re-assembling the chunked file using details`
			`of the gpg output. Which would expose approximate`
			`file size even if padding is being used to obscure it.`
chunk then encrypt 2014-07-24 02:38:14 +00:00
			`Note that this means that the chunks won't exactly match the configured`
			`chunk size. gpg does compression, which might make them a`
			lot smaller. Or gpg overhead could make them slightly larger. So `hasKey`
			`cannot check exact file sizes.`

			`If padding is enabled, gpg compression should be disabled, to not leak`
			`clues about how well the files compress and so what kind of file it is.`
thought about chunk key hashing 2014-07-25 19:12:51 +00:00
			`## chunk key hashing`

			`A chunk key should hash into the same directory structure as its parent`
			`key. This will avoid lots of extra hash directories when using chunking`
			`with non-encrypted keys.`

			`Won't happen when the key is encrypted, but that is good; hashing to the`
			`same bucket then would allow statistical correlation.`