2013-07-23 22:46:09 +00:00
|
|
|
To avoid leaking even the size of your encrypted files to cloud storage
|
|
|
|
providers, add a mode that stores fixed size chunks.
|
|
|
|
|
|
|
|
May be a useful starting point for [[deltas]].
|
2013-07-26 05:20:09 +00:00
|
|
|
|
|
|
|
May also allow for downloading different chunks of a file concurrently from
|
|
|
|
multiple remotes.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
# currently
|
|
|
|
|
|
|
|
Currently, only the webdav and directory special remotes support chunking.
|
|
|
|
|
|
|
|
Filenames are used for the chunks that make it easy to see which chunks
|
|
|
|
belong together, even when encryption is used. There is also a chunkcount
|
|
|
|
file, that similarly leaks information.
|
|
|
|
|
|
|
|
It is not currently possible to enable chunking on a non-chunked remote.
|
|
|
|
|
|
|
|
Problem: Two uploads of the same key from repos with different chunk sizes
|
2014-07-24 16:41:34 +00:00
|
|
|
could lead to data loss. For example, suppose A is 10 mb chunksize, and B
|
|
|
|
is 20 mb, and the upload speed is the same. If B starts first, when A will
|
|
|
|
overwrite the file it is uploading for the 1st chunk. Then A uploads the
|
|
|
|
second chunk, and once A is done, B finishes the 1st chunk and uploads its
|
|
|
|
second. We now have [chunk 1(from A), chunk 2(from B)].
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
# new requirements
|
|
|
|
|
|
|
|
Every special remote should support chunking. (It does not make sense
|
|
|
|
to support it for git remotes, but gcrypt remotes should support it.)
|
|
|
|
|
|
|
|
S3 remotes should chunk by default, because the current S3 backend fails
|
2014-07-23 21:56:26 +00:00
|
|
|
for files past a certian size. See [[bugs/Truncated_file_transferred_via_S3]].
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
The size of chunks, as well as whether any chunking is done, should be
|
|
|
|
configurable on the fly without invalidating data already stored in the
|
|
|
|
remote. This seems important for usability (eg, so users can turn chunking
|
|
|
|
on in the webapp when configuring an existing remote).
|
|
|
|
|
|
|
|
Two concurrent uploaders of the same object to a remote should be safe,
|
|
|
|
even if they're using different chunk sizes.
|
|
|
|
|
2014-07-23 21:55:28 +00:00
|
|
|
The old chunk method needs to be supported for back-compat, so
|
|
|
|
keep the chunksize= setting to enable that mode, and add a new setting
|
|
|
|
for the new mode.
|
|
|
|
|
2014-07-23 21:01:30 +00:00
|
|
|
# obscuring file sizes
|
|
|
|
|
|
|
|
To hide from a remote any information about the sizes of files could be
|
|
|
|
another goal of chunking. At least two things are needed for this:
|
|
|
|
|
|
|
|
1. The filenames used on the remote don't indicate which chunks belong
|
|
|
|
together.
|
|
|
|
|
|
|
|
2. The final short chunk needs to be padded with random data,
|
|
|
|
so that a remote sees only encrypted files with uniform sizes
|
|
|
|
and cannot make guesses about the kinds of data being stored.
|
|
|
|
|
|
|
|
Note that padding cannot completely hide all information from an attacker
|
|
|
|
who is logging puts or gets. An attacker could, for example, look at the
|
|
|
|
times of puts, and guess at when git-annex has moved on to
|
|
|
|
encrypting/decrypting the next object, and so guess at the approximate
|
|
|
|
sizes of objects. (Concurrent uploads/downloads or random delays could be
|
|
|
|
added to prevent these kinds of attacks.)
|
|
|
|
|
|
|
|
And, obviously, if someone stores 10 tb of data in a remote, they probably
|
|
|
|
have around 10 tb of files, so it's probably not a collection of recipes..
|
|
|
|
|
|
|
|
Given its inneficiencies and lack of fully obscuring file sizes,
|
2014-07-23 21:55:28 +00:00
|
|
|
padding may not be worth adding, but is considered in the designs below.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
# design 1
|
|
|
|
|
|
|
|
Add an optional chunk field to Key. It is only present for chunk
|
|
|
|
2 and above. Ie, SHA256-s12345--xxxxxxx is the first chunk (or whole
|
|
|
|
object), while SHA256-s12345-c2--xxxxxxx is the second chunk.
|
|
|
|
|
|
|
|
On an encrypted remote, Keys are generated with the chunk field, and then
|
|
|
|
HMAC enrypted.
|
|
|
|
|
|
|
|
Note that only using it for chunks 2+ means that git-annex can start by
|
|
|
|
requesting the regular key, so an observer sees the same request whether
|
|
|
|
chunked or not, and does not see eg, a pattern of failed requests for
|
|
|
|
a non-chunked key, followed by successful requests for the chunked keys.
|
|
|
|
(Both more efficient and perhaps more secure.)
|
|
|
|
|
|
|
|
Problem: This makes putting chunks easy. But there is a problem when getting
|
|
|
|
an object that has been chunked. If the key size is not known, we
|
|
|
|
cannot tell when we've gotten the last chunk. (Also, we cannot strip
|
|
|
|
padding.) Note that `addurl` sometimes generates keys w/o size info
|
|
|
|
(particularly, it does so by design when using quvi).
|
|
|
|
|
|
|
|
Problem: Also, this makes `hasKey` hard to implement: How can it know if
|
|
|
|
all the chunks are present, if the key size is not known?
|
|
|
|
|
|
|
|
Problem: Also, this makes it difficult to download encrypted keys, because
|
|
|
|
we only know the decrypted size, not the encrypted size, so we can't
|
|
|
|
be sure how many chunks to get, and all chunks need to be downloaded before
|
2014-07-24 16:41:34 +00:00
|
|
|
we can decrypt any of them. (Assuming we encrypt first; chunking first
|
|
|
|
avoids this problem.)
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
Problem: Does not solve concurrent uploads with different chunk sizes.
|
|
|
|
|
|
|
|
# design 2
|
|
|
|
|
|
|
|
When chunking is enabled, always put a chunk number in the Key,
|
|
|
|
along with the chunk size.
|
2014-07-24 17:36:23 +00:00
|
|
|
So, SHA256-1048576-c1--xxxxxxx for the first chunk of 1 megabyte.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
Before any chunks are stored, write a chunkcount file, eg
|
|
|
|
SHA256-s12345-c0--xxxxxxx. Note that this key is the same as the original
|
|
|
|
object's key, except with chunk number set to 0. This file contains both
|
|
|
|
the number of chunks, and also the chunk size used. `hasKey` downloads this
|
|
|
|
file, and then verifies that each chunk is present, looking for keys with
|
|
|
|
the expected chunk numbers and chunk size.
|
|
|
|
|
|
|
|
This avoids problems with multiple writers using different chunk sizes,
|
|
|
|
since they will be uploading to different files.
|
|
|
|
|
|
|
|
Problem: In such a situation, some duplicate data might be stored, not
|
|
|
|
referenced by the last chunkcount file to be written. It would not be
|
|
|
|
dropped when the key was removed from the remote.
|
|
|
|
|
|
|
|
Note: This design lets an attacker with logs tell the (appoximate) size of
|
|
|
|
objects, by finding the small files that contain a chunk count, and
|
|
|
|
correlating when that is written/read and when other files are
|
|
|
|
written/read. That could be solved by padding the chunkcount key up to the
|
|
|
|
size of the rest of the keys, but that's very innefficient; `hasKey` is not
|
|
|
|
designed to need to download large files.
|
|
|
|
|
|
|
|
# design 3
|
|
|
|
|
|
|
|
Like design 1, but add an encrypted chunk count prefix to the first object.
|
|
|
|
This needs to be done in a way that does not let an attacker tell if the
|
|
|
|
object has an encrypted chunk count prefix or not.
|
|
|
|
|
|
|
|
This seems difficult; attacker could probably tell where the first encrypted
|
|
|
|
part stops and the next encrypted part starts by looking for gpg headers,
|
|
|
|
and so tell which files are the first chunks.
|
|
|
|
|
|
|
|
Also, `hasKey` would need to download some or all of the first file.
|
|
|
|
If all, that's a lot more expensive. If only some is downloaded, an
|
|
|
|
attacker can guess that the file that was partially downloaded is the
|
|
|
|
first chunk in a series, and wait for a time when it's fully downloaded to
|
|
|
|
determine which are the other chunks.
|
|
|
|
|
|
|
|
Problem: Two uploads of the same key from repos with different chunk sizes
|
|
|
|
could lead to data loss. (Same as in design 2.)
|
|
|
|
|
|
|
|
# design 4
|
|
|
|
|
2014-07-24 17:36:23 +00:00
|
|
|
Use key SHA256-s12345-S1048576-C1--xxxxxxx for the first chunk of 1 megabyte.
|
|
|
|
|
|
|
|
Note that keeping the 's'ize field unchanged is necessary because it
|
|
|
|
disambiguates eg, WORM keys. So a 'S'ize field is used to hold the chunk
|
|
|
|
size.
|
2014-07-23 21:55:28 +00:00
|
|
|
|
2014-07-23 21:01:30 +00:00
|
|
|
Instead of storing the chunk count in the special remote, store it in
|
|
|
|
the git-annex branch.
|
|
|
|
|
2014-07-24 17:36:23 +00:00
|
|
|
The location log does not record locations of individual chunk keys
|
|
|
|
(too space-inneficient).
|
|
|
|
Instead, look at git-annex:aaa/bbb/SHA256-s12345--xxxxxxx.log.cnk to get
|
2014-07-24 20:23:36 +00:00
|
|
|
the chunk count and size for a key.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
2014-07-24 20:23:36 +00:00
|
|
|
Note that a given remote uuid might have multiple chunk sizes logged, if a
|
|
|
|
key was stored on it twice using different chunk sizes. Also note that even
|
|
|
|
when this file exists for a key, the object may be stored non-chunked on
|
|
|
|
the remote too.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
`hasKey` would check if any one (chunksize, chunkcount) is satisfied by
|
|
|
|
the files on the remote. It would also check if the non-chunked key is
|
2014-07-24 16:41:34 +00:00
|
|
|
present, as a fallback.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
When dropping a key from the remote, drop all logged chunk sizes.
|
2014-07-23 21:55:28 +00:00
|
|
|
(Also drop any non-chunked key.)
|
|
|
|
|
2014-07-24 16:41:34 +00:00
|
|
|
As long as the location log and the chunk log are committed atomically,
|
2014-07-23 21:01:30 +00:00
|
|
|
this guarantees that no orphaned chunks end up on a remote
|
|
|
|
(except any that might be left by interrupted uploads).
|
|
|
|
|
|
|
|
This has the best security of the designs so far, because the special
|
|
|
|
remote doesn't know anything about chunk sizes. It uses a little more
|
|
|
|
data in the git-annex branch, although with care (using the same timestamp
|
|
|
|
as the location log), it can compress pretty well.
|
2014-07-24 02:38:14 +00:00
|
|
|
|
|
|
|
## chunk then encrypt
|
|
|
|
|
|
|
|
Rather than encrypting the whole object 1st and then chunking, chunk and
|
|
|
|
then encrypt.
|
|
|
|
|
|
|
|
Reasons:
|
|
|
|
|
|
|
|
1. If 2 repos are uploading the same key to a remote concurrently,
|
|
|
|
this allows some chunks to come from one and some from another,
|
|
|
|
and be reassembled without problems.
|
|
|
|
|
2014-07-24 16:41:34 +00:00
|
|
|
2. Also allows chunks of the same object to be downloaded from different
|
|
|
|
remotes, perhaps concurrently, and again be reassembled without
|
|
|
|
problems.
|
|
|
|
|
|
|
|
3. Prevents an attacker from re-assembling the chunked file using details
|
|
|
|
of the gpg output. Which would expose approximate
|
|
|
|
file size even if padding is being used to obscure it.
|
2014-07-24 02:38:14 +00:00
|
|
|
|
|
|
|
Note that this means that the chunks won't exactly match the configured
|
|
|
|
chunk size. gpg does compression, which might make them a
|
|
|
|
lot smaller. Or gpg overhead could make them slightly larger. So `hasKey`
|
|
|
|
cannot check exact file sizes.
|
|
|
|
|
|
|
|
If padding is enabled, gpg compression should be disabled, to not leak
|
|
|
|
clues about how well the files compress and so what kind of file it is.
|
2014-07-25 19:12:51 +00:00
|
|
|
|
|
|
|
## chunk key hashing
|
|
|
|
|
|
|
|
A chunk key should hash into the same directory structure as its parent
|
|
|
|
key. This will avoid lots of extra hash directories when using chunking
|
|
|
|
with non-encrypted keys.
|
|
|
|
|
|
|
|
Won't happen when the key is encrypted, but that is good; hashing to the
|
|
|
|
same bucket then would allow statistical correlation.
|