2013-07-23 22:46:09 +00:00
|
|
|
To avoid leaking even the size of your encrypted files to cloud storage
|
|
|
|
providers, add a mode that stores fixed size chunks.
|
|
|
|
|
|
|
|
May be a useful starting point for [[deltas]].
|
2013-07-26 05:20:09 +00:00
|
|
|
|
|
|
|
May also allow for downloading different chunks of a file concurrently from
|
|
|
|
multiple remotes.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
2014-07-27 16:23:28 +00:00
|
|
|
Also, can allow resuming of interrupted uploads and downloads.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
2014-07-27 16:23:28 +00:00
|
|
|
# legacy chunking
|
|
|
|
|
|
|
|
Supported by only the webdav and directory special remotes.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
Filenames are used for the chunks that make it easy to see which chunks
|
|
|
|
belong together, even when encryption is used. There is also a chunkcount
|
|
|
|
file, that similarly leaks information.
|
|
|
|
|
2014-07-27 16:23:28 +00:00
|
|
|
It is not possible to enable chunking on a non-chunked remote.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
Problem: Two uploads of the same key from repos with different chunk sizes
|
2014-07-24 16:41:34 +00:00
|
|
|
could lead to data loss. For example, suppose A is 10 mb chunksize, and B
|
|
|
|
is 20 mb, and the upload speed is the same. If B starts first, when A will
|
|
|
|
overwrite the file it is uploading for the 1st chunk. Then A uploads the
|
|
|
|
second chunk, and once A is done, B finishes the 1st chunk and uploads its
|
|
|
|
second. We now have [chunk 1(from A), chunk 2(from B)].
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
# new requirements
|
|
|
|
|
|
|
|
Every special remote should support chunking. (It does not make sense
|
|
|
|
to support it for git remotes, but gcrypt remotes should support it.)
|
|
|
|
|
|
|
|
S3 remotes should chunk by default, because the current S3 backend fails
|
2014-07-23 21:56:26 +00:00
|
|
|
for files past a certian size. See [[bugs/Truncated_file_transferred_via_S3]].
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
The size of chunks, as well as whether any chunking is done, should be
|
|
|
|
configurable on the fly without invalidating data already stored in the
|
|
|
|
remote. This seems important for usability (eg, so users can turn chunking
|
|
|
|
on in the webapp when configuring an existing remote).
|
|
|
|
|
|
|
|
Two concurrent uploaders of the same object to a remote should be safe,
|
|
|
|
even if they're using different chunk sizes.
|
|
|
|
|
2014-07-27 16:23:28 +00:00
|
|
|
The legacy chunk method needs to be supported for back-compat, so
|
|
|
|
keep the chunksize= setting to enable that mode, and add a new chunk=
|
|
|
|
setting for the new mode.
|
2014-07-23 21:55:28 +00:00
|
|
|
|
2014-07-23 21:01:30 +00:00
|
|
|
# obscuring file sizes
|
|
|
|
|
|
|
|
To hide from a remote any information about the sizes of files could be
|
|
|
|
another goal of chunking. At least two things are needed for this:
|
|
|
|
|
|
|
|
1. The filenames used on the remote don't indicate which chunks belong
|
|
|
|
together.
|
|
|
|
|
|
|
|
2. The final short chunk needs to be padded with random data,
|
|
|
|
so that a remote sees only encrypted files with uniform sizes
|
|
|
|
and cannot make guesses about the kinds of data being stored.
|
|
|
|
|
|
|
|
Note that padding cannot completely hide all information from an attacker
|
|
|
|
who is logging puts or gets. An attacker could, for example, look at the
|
|
|
|
times of puts, and guess at when git-annex has moved on to
|
|
|
|
encrypting/decrypting the next object, and so guess at the approximate
|
|
|
|
sizes of objects. (Concurrent uploads/downloads or random delays could be
|
|
|
|
added to prevent these kinds of attacks.)
|
|
|
|
|
|
|
|
And, obviously, if someone stores 10 tb of data in a remote, they probably
|
|
|
|
have around 10 tb of files, so it's probably not a collection of recipes..
|
|
|
|
|
|
|
|
Given its inneficiencies and lack of fully obscuring file sizes,
|
2014-07-23 21:55:28 +00:00
|
|
|
padding may not be worth adding, but is considered in the designs below.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
# design 1
|
|
|
|
|
|
|
|
Add an optional chunk field to Key. It is only present for chunk
|
|
|
|
2 and above. Ie, SHA256-s12345--xxxxxxx is the first chunk (or whole
|
|
|
|
object), while SHA256-s12345-c2--xxxxxxx is the second chunk.
|
|
|
|
|
|
|
|
On an encrypted remote, Keys are generated with the chunk field, and then
|
|
|
|
HMAC enrypted.
|
|
|
|
|
|
|
|
Note that only using it for chunks 2+ means that git-annex can start by
|
|
|
|
requesting the regular key, so an observer sees the same request whether
|
|
|
|
chunked or not, and does not see eg, a pattern of failed requests for
|
|
|
|
a non-chunked key, followed by successful requests for the chunked keys.
|
|
|
|
(Both more efficient and perhaps more secure.)
|
|
|
|
|
|
|
|
Problem: This makes putting chunks easy. But there is a problem when getting
|
|
|
|
an object that has been chunked. If the key size is not known, we
|
|
|
|
cannot tell when we've gotten the last chunk. (Also, we cannot strip
|
|
|
|
padding.) Note that `addurl` sometimes generates keys w/o size info
|
|
|
|
(particularly, it does so by design when using quvi).
|
|
|
|
|
2014-08-06 17:45:19 +00:00
|
|
|
Problem: Also, this makes `checkPresent` hard to implement: How can it know if
|
2014-07-23 21:01:30 +00:00
|
|
|
all the chunks are present, if the key size is not known?
|
|
|
|
|
|
|
|
Problem: Also, this makes it difficult to download encrypted keys, because
|
|
|
|
we only know the decrypted size, not the encrypted size, so we can't
|
|
|
|
be sure how many chunks to get, and all chunks need to be downloaded before
|
2014-07-24 16:41:34 +00:00
|
|
|
we can decrypt any of them. (Assuming we encrypt first; chunking first
|
|
|
|
avoids this problem.)
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
Problem: Does not solve concurrent uploads with different chunk sizes.
|
|
|
|
|
|
|
|
# design 2
|
|
|
|
|
|
|
|
When chunking is enabled, always put a chunk number in the Key,
|
|
|
|
along with the chunk size.
|
2014-07-24 17:36:23 +00:00
|
|
|
So, SHA256-1048576-c1--xxxxxxx for the first chunk of 1 megabyte.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
Before any chunks are stored, write a chunkcount file, eg
|
|
|
|
SHA256-s12345-c0--xxxxxxx. Note that this key is the same as the original
|
|
|
|
object's key, except with chunk number set to 0. This file contains both
|
2014-08-06 17:45:19 +00:00
|
|
|
the number of chunks, and also the chunk size used. `checkPresent` downloads this
|
2014-07-23 21:01:30 +00:00
|
|
|
file, and then verifies that each chunk is present, looking for keys with
|
|
|
|
the expected chunk numbers and chunk size.
|
|
|
|
|
|
|
|
This avoids problems with multiple writers using different chunk sizes,
|
|
|
|
since they will be uploading to different files.
|
|
|
|
|
|
|
|
Problem: In such a situation, some duplicate data might be stored, not
|
|
|
|
referenced by the last chunkcount file to be written. It would not be
|
|
|
|
dropped when the key was removed from the remote.
|
|
|
|
|
|
|
|
Note: This design lets an attacker with logs tell the (appoximate) size of
|
|
|
|
objects, by finding the small files that contain a chunk count, and
|
|
|
|
correlating when that is written/read and when other files are
|
|
|
|
written/read. That could be solved by padding the chunkcount key up to the
|
2014-08-06 17:45:19 +00:00
|
|
|
size of the rest of the keys, but that's very innefficient; `checkPresent` is not
|
2014-07-23 21:01:30 +00:00
|
|
|
designed to need to download large files.
|
|
|
|
|
|
|
|
# design 3
|
|
|
|
|
|
|
|
Like design 1, but add an encrypted chunk count prefix to the first object.
|
|
|
|
This needs to be done in a way that does not let an attacker tell if the
|
|
|
|
object has an encrypted chunk count prefix or not.
|
|
|
|
|
|
|
|
This seems difficult; attacker could probably tell where the first encrypted
|
|
|
|
part stops and the next encrypted part starts by looking for gpg headers,
|
|
|
|
and so tell which files are the first chunks.
|
|
|
|
|
2014-08-06 17:45:19 +00:00
|
|
|
Also, `checkPresent` would need to download some or all of the first file.
|
2014-07-23 21:01:30 +00:00
|
|
|
If all, that's a lot more expensive. If only some is downloaded, an
|
|
|
|
attacker can guess that the file that was partially downloaded is the
|
|
|
|
first chunk in a series, and wait for a time when it's fully downloaded to
|
|
|
|
determine which are the other chunks.
|
|
|
|
|
|
|
|
Problem: Two uploads of the same key from repos with different chunk sizes
|
|
|
|
could lead to data loss. (Same as in design 2.)
|
|
|
|
|
|
|
|
# design 4
|
|
|
|
|
2014-07-24 17:36:23 +00:00
|
|
|
Use key SHA256-s12345-S1048576-C1--xxxxxxx for the first chunk of 1 megabyte.
|
|
|
|
|
|
|
|
Note that keeping the 's'ize field unchanged is necessary because it
|
|
|
|
disambiguates eg, WORM keys. So a 'S'ize field is used to hold the chunk
|
|
|
|
size.
|
2014-07-23 21:55:28 +00:00
|
|
|
|
2014-07-23 21:01:30 +00:00
|
|
|
Instead of storing the chunk count in the special remote, store it in
|
|
|
|
the git-annex branch.
|
|
|
|
|
2014-07-24 17:36:23 +00:00
|
|
|
The location log does not record locations of individual chunk keys
|
2014-07-28 17:00:46 +00:00
|
|
|
(too space-inneficient). Instead, look at a chunk log in the
|
|
|
|
git-annex branch to get the chunk count and size for a key.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
2014-08-06 17:45:19 +00:00
|
|
|
`checkPresent` would check if any of the logged sets of chunks is
|
2014-07-28 17:00:46 +00:00
|
|
|
present on the remote. It would also check if the non-chunked key is
|
2014-07-24 16:41:34 +00:00
|
|
|
present, as a fallback.
|
2014-07-23 21:01:30 +00:00
|
|
|
|
|
|
|
When dropping a key from the remote, drop all logged chunk sizes.
|
2014-07-23 21:55:28 +00:00
|
|
|
(Also drop any non-chunked key.)
|
|
|
|
|
2014-07-24 16:41:34 +00:00
|
|
|
As long as the location log and the chunk log are committed atomically,
|
2014-07-23 21:01:30 +00:00
|
|
|
this guarantees that no orphaned chunks end up on a remote
|
|
|
|
(except any that might be left by interrupted uploads).
|
|
|
|
|
|
|
|
This has the best security of the designs so far, because the special
|
|
|
|
remote doesn't know anything about chunk sizes. It uses a little more
|
|
|
|
data in the git-annex branch, although with care (using the same timestamp
|
|
|
|
as the location log), it can compress pretty well.
|
2014-07-24 02:38:14 +00:00
|
|
|
|
2014-07-28 17:00:46 +00:00
|
|
|
## chunk log
|
|
|
|
|
|
|
|
Stored in the git-annex branch, this provides a mapping `Key -> [[Key]]`.
|
|
|
|
|
|
|
|
Note that a given remote uuid might have multiple sets of chunks (with
|
|
|
|
different sizes) logged, if a key was stored on it twice using different
|
|
|
|
chunk sizes. Also note that even when the log indicates a key is chunked,
|
|
|
|
the object may be stored non-chunked on the remote too.
|
|
|
|
|
|
|
|
For fixed size chunks, there's no need to store the list of chunk keys,
|
|
|
|
instead the log only records the number of chunks (needed because the size
|
|
|
|
of the parent Key may not be known), and the chunk size.
|
|
|
|
|
|
|
|
Example:
|
|
|
|
|
|
|
|
1287290776.765152s e605dca6-446a-11e0-8b2a-002170d25c55:10240 9
|
|
|
|
|
|
|
|
Later, might want to support other kinds of chunks, for example ones made
|
|
|
|
using a rsync-style rolling checksum. It would probably not make sense to
|
|
|
|
store the full [Key] list for such chunks in the log. Instead, it might be
|
|
|
|
stored in a file on the remote.
|
|
|
|
|
|
|
|
To support such future developments, when updating the chunk log,
|
|
|
|
git-annex should preserve unparsable values (the part after the colon).
|
|
|
|
|
2014-07-24 02:38:14 +00:00
|
|
|
## chunk then encrypt
|
|
|
|
|
|
|
|
Rather than encrypting the whole object 1st and then chunking, chunk and
|
|
|
|
then encrypt.
|
|
|
|
|
|
|
|
Reasons:
|
|
|
|
|
|
|
|
1. If 2 repos are uploading the same key to a remote concurrently,
|
|
|
|
this allows some chunks to come from one and some from another,
|
|
|
|
and be reassembled without problems.
|
|
|
|
|
2014-07-24 16:41:34 +00:00
|
|
|
2. Also allows chunks of the same object to be downloaded from different
|
|
|
|
remotes, perhaps concurrently, and again be reassembled without
|
|
|
|
problems.
|
|
|
|
|
|
|
|
3. Prevents an attacker from re-assembling the chunked file using details
|
|
|
|
of the gpg output. Which would expose approximate
|
|
|
|
file size even if padding is being used to obscure it.
|
2014-07-24 02:38:14 +00:00
|
|
|
|
|
|
|
Note that this means that the chunks won't exactly match the configured
|
|
|
|
chunk size. gpg does compression, which might make them a
|
2014-08-06 17:45:19 +00:00
|
|
|
lot smaller. Or gpg overhead could make them slightly larger. So `checkPresent`
|
2014-07-24 02:38:14 +00:00
|
|
|
cannot check exact file sizes.
|
|
|
|
|
|
|
|
If padding is enabled, gpg compression should be disabled, to not leak
|
|
|
|
clues about how well the files compress and so what kind of file it is.
|
2014-07-27 16:23:28 +00:00
|
|
|
|
2014-07-25 19:12:51 +00:00
|
|
|
## chunk key hashing
|
|
|
|
|
|
|
|
A chunk key should hash into the same directory structure as its parent
|
|
|
|
key. This will avoid lots of extra hash directories when using chunking
|
|
|
|
with non-encrypted keys.
|
|
|
|
|
|
|
|
Won't happen when the key is encrypted, but that is good; hashing to the
|
|
|
|
same bucket then would allow statistical correlation.
|
2014-07-27 16:24:03 +00:00
|
|
|
|
2014-07-27 16:23:28 +00:00
|
|
|
## resuming interupted transfers
|
|
|
|
|
|
|
|
Resuming interrupted downloads, and uploads are both possible.
|
|
|
|
|
|
|
|
Downloads: If the tmp file for a key exists, round it to the chunk size,
|
|
|
|
and skip forward to the next needed chunk. Easy.
|
|
|
|
|
|
|
|
Uploads: Check if the 1st chunk is present. If so, check the second chunk,
|
|
|
|
etc. Once the first missing chunk is found, start uploading from there.
|
|
|
|
|
2014-08-06 17:45:19 +00:00
|
|
|
That adds one extra checkPresent call per upload. Probably a win in most cases.
|
2014-07-27 16:23:28 +00:00
|
|
|
Can be improved by making special remotes open a persistent
|
|
|
|
connection that is used for transferring all chunks, as well as for
|
2014-08-06 17:45:19 +00:00
|
|
|
checking checkPresent.
|
2014-07-27 16:23:28 +00:00
|
|
|
|
|
|
|
Note that this is safe to do only as long as the Key being transferred
|
|
|
|
cannot possibly have 2 different contents in different repos. Notably not
|
|
|
|
necessarily the case for the URL keys generated for quvi.
|
2014-07-30 14:43:21 +00:00
|
|
|
|
|
|
|
Both **done**.
|
|
|
|
|
|
|
|
## parallel
|
|
|
|
|
|
|
|
If 2 remotes both support chunking, uploading could upload different chunks
|
|
|
|
to them in parallel. However, the chunk log does not currently allow
|
|
|
|
representing the state where some chunks are on one remote and others on
|
|
|
|
another remote.
|
|
|
|
|
|
|
|
Parallel downloading of chunks from different remotes is a bit more doable.
|