107 lines
5 KiB
Markdown
107 lines
5 KiB
Markdown
This was the design doc for [[encryption]] and is preserved for
|
|
the curious.
|
|
|
|
[[!toc]]
|
|
|
|
## encryption backends
|
|
|
|
It makes sense to support multiple encryption backends. So, there
|
|
should be a way to tell what backend is responsible for a given filename
|
|
in an encrypted remote. (And since special remotes can also store files
|
|
unencrypted, differentiate from those as well.)
|
|
|
|
The rest of this page will describe a single encryption backend using GPG.
|
|
Probably only one will be needed, but who knows? Maybe that backend will
|
|
turn out badly designed, or some other encryptor needed. Designing
|
|
with more than one encryption backend in mind helps future-proofing.
|
|
|
|
## encryption key management
|
|
|
|
[[!template id=note text="""
|
|
The basis of this scheme was originally developed by Lars Wirzenius et al
|
|
[for Obnam](http://braawi.org/obnam/encryption/).
|
|
"""]]
|
|
|
|
Data is encrypted by gpg, using a symmetric cipher.
|
|
The cipher is itself checked into your git repository, encrypted using one or
|
|
more gpg public keys. This scheme allows new gpg private keys to be given
|
|
access to content that has already been stored in the remote.
|
|
|
|
Different encrypted remotes need to be able to each use different ciphers.
|
|
Allowing multiple ciphers to be used within a single remote would add a lot
|
|
of complexity, so is not planned to be supported.
|
|
Instead, if you want a new cipher, create a new S3 bucket, or whatever.
|
|
There does not seem to be much benefit to using the same cipher for
|
|
two different enrypted remotes.
|
|
|
|
So, the encrypted cipher could just be stored with the rest of a remote's
|
|
configuration in `.git-annex/remotes.log` (see [[internals]]). When `git
|
|
annex intiremote` makes a remote, it can generate a random symmetric
|
|
cipher, and encrypt it with the specified gpg key. To allow another gpg
|
|
public key access, update the encrypted cipher to be encrypted to both gpg
|
|
keys.
|
|
|
|
## filename enumeration
|
|
|
|
If the names of files are encrypted or securely hashed, or whatever is
|
|
chosen, this makes it harder for git-annex (let alone untrusted third parties!)
|
|
to get a list of the files that are stored on a given enrypted remote.
|
|
But, does git-annex really ever need to do such an enumeration?
|
|
|
|
Apparently not. `git annex unused --from remote` can now check for
|
|
unused data that is stored on a remote, and it does so based only on
|
|
location log data for the remote. This assumes that the location log is
|
|
kept accurately.
|
|
|
|
What about `git annex fsck --from remote`? Such a command should be able to,
|
|
for each file in the repository, contact the encrypted remote to check
|
|
if it has the file. This can be done without enumeration, although it will
|
|
mean running gpg once per file fscked, to get the encrypted filename.
|
|
|
|
So, the files stored in the remote should be encrypted. But, it needs
|
|
to be a repeatable encryption, so they cannot just be gpg encrypted,
|
|
that would yeild a new name each time. Instead, HMAC is used. Any hash
|
|
could be used with HMAC; currently SHA1 is used.
|
|
|
|
It was suggested that it might not be wise to use the same cipher for both
|
|
gpg and HMAC. Being paranoid, it's best not to tie the security of one
|
|
to the security of the other. So, the encrypted cipher described above is
|
|
actually split in two; half is used for HMAC, and half for gpg.
|
|
|
|
----
|
|
|
|
Does the HMAC cipher need to be gpg encrypted? Imagine if it were
|
|
stored in plainext in the git repository. Anyone who can access
|
|
the git repository already knows the actual filenames, and typically also
|
|
the content hashes of annexed content. Having access to the HMAC cipher
|
|
could perhaps be said to only let them verify that data they already
|
|
know.
|
|
|
|
While this seems a pretty persuasive argument, I'm not 100% convinced, and
|
|
anyway, most times that the HMAC cipher is needed, the gpg cipher is also
|
|
needed. Keeping the HMAC cipher encrypted does slow down two things:
|
|
dropping content from encrypted remotes, and checking if encrypted remotes
|
|
really have content. If it's later determined to be safe to not encrypt the
|
|
HMAC cipher, the current design allows changing that, even for existing
|
|
remotes.
|
|
|
|
## risks
|
|
|
|
A risk of this scheme is that, once the symmetric cipher has been obtained, it
|
|
allows full access to all the encrypted content. This scheme does not allow
|
|
revoking a given gpg key access to the cipher, since anyone with such a key
|
|
could have already decrypted the cipher and stored a copy.
|
|
|
|
If git-annex stores the decrypted symmetric cipher in memory, then there
|
|
is a risk that it could be intercepted from there by an attacker. Gpg
|
|
amelorates these type of risks by using locked memory. For git-annex, note
|
|
that an attacker with local machine access can tell at least all the
|
|
filenames and metadata of files stored in the encrypted remote anyway,
|
|
and can access whatever content is stored locally.
|
|
|
|
This design does not support obfuscating the size of files by chunking
|
|
them, as that would have added a lot of complexity, for dubious benefits.
|
|
If the untrusted party running the encrypted remote wants to know file sizes,
|
|
they could correlate chunks that are accessed together. Encrypting data
|
|
changes the original file size enough to avoid it being used as a direct
|
|
fingerprint at least.
|