124 lines
5.9 KiB
Text
124 lines
5.9 KiB
Text
|
This was the design doc for [[/encryption]] and is preserved for
|
||
|
the curious. For an example of using git-annex with an encrypted S3 remote,
|
||
|
see [[tips/using_Amazon_S3]].
|
||
|
|
||
|
[[!toc]]
|
||
|
|
||
|
## encryption backends
|
||
|
|
||
|
It makes sense to support multiple encryption backends. So, there
|
||
|
should be a way to tell what backend is responsible for a given filename
|
||
|
in an encrypted remote. (And since special remotes can also store files
|
||
|
unencrypted, differentiate from those as well.)
|
||
|
|
||
|
The rest of this page will describe a single encryption backend using GPG.
|
||
|
Probably only one will be needed, but who knows? Maybe that backend will
|
||
|
turn out badly designed, or some other encryptor needed. Designing
|
||
|
with more than one encryption backend in mind helps future-proofing.
|
||
|
|
||
|
## encryption key management
|
||
|
|
||
|
[[!template id=note text="""
|
||
|
The basis of this scheme was originally developed by Lars Wirzenius et al
|
||
|
[for Obnam](http://liw.fi/obnam/encryption/).
|
||
|
"""]]
|
||
|
|
||
|
Data is encrypted by GnuPG, using a symmetric cipher. The cipher is
|
||
|
generated by GnuPG when the special remote is created. By default the
|
||
|
best entropy pool is used, hence the generation may take a while; One
|
||
|
can use `initremote` with `highRandomQuality=false` or `--fast` options
|
||
|
to speed up things, but at the expense of using random numbers of a
|
||
|
lower quality. The generated cipher is then checked into your git
|
||
|
repository, encrypted using one or more OpenPGP public keys. This scheme
|
||
|
allows new OpenPGP private keys to be given access to content that has
|
||
|
already been stored in the remote.
|
||
|
|
||
|
Different encrypted remotes need to be able to each use different ciphers.
|
||
|
Allowing multiple ciphers to be used within a single remote would add a lot
|
||
|
of complexity, so is not planned to be supported.
|
||
|
Instead, if you want a new cipher, create a new S3 bucket, or whatever.
|
||
|
There does not seem to be much benefit to using the same cipher for
|
||
|
two different encrypted remotes.
|
||
|
|
||
|
So, the encrypted cipher could just be stored with the rest of a remote's
|
||
|
configuration in `remotes.log` (see [[internals]]). When `git
|
||
|
annex intiremote` makes a remote, it can generate a random symmetric
|
||
|
cipher, and encrypt it with the specified gpg key. To allow another gpg
|
||
|
public key access, update the encrypted cipher to be encrypted to both gpg
|
||
|
keys.
|
||
|
|
||
|
## filename enumeration
|
||
|
|
||
|
If the names of files are encrypted or securely hashed, or whatever is
|
||
|
chosen, this makes it harder for git-annex (let alone untrusted third parties!)
|
||
|
to get a list of the files that are stored on a given enrypted remote.
|
||
|
But, does git-annex really ever need to do such an enumeration?
|
||
|
|
||
|
Apparently not. `git annex unused --from remote` can now check for
|
||
|
unused data that is stored on a remote, and it does so based only on
|
||
|
location log data for the remote. This assumes that the location log is
|
||
|
kept accurately.
|
||
|
|
||
|
What about `git annex fsck --from remote`? Such a command should be able to,
|
||
|
for each file in the repository, contact the encrypted remote to check
|
||
|
if it has the file. This can be done without enumeration, although it will
|
||
|
mean running gpg once per file fscked, to get the encrypted filename.
|
||
|
|
||
|
So, the files stored in the remote should be encrypted. But, it needs to
|
||
|
be a repeatable encryption, so they cannot just be gpg encrypted, that
|
||
|
would yeild a new name each time. Instead, HMAC is used. Any hash could
|
||
|
be used with HMAC. SHA-1 is the default, but [[other_hashes|/encryption]]
|
||
|
can be chosen for new remotes.
|
||
|
|
||
|
It was suggested that it might not be wise to use the same cipher for both
|
||
|
gpg and HMAC. Being paranoid, it's best not to tie the security of one
|
||
|
to the security of the other. So, the encrypted cipher described above is
|
||
|
actually split in two; half is used for HMAC, and half for gpg.
|
||
|
|
||
|
----
|
||
|
|
||
|
Does the HMAC cipher need to be gpg encrypted? Imagine if it were
|
||
|
stored in plainext in the git repository. Anyone who can access
|
||
|
the git repository already knows the actual filenames, and typically also
|
||
|
the content hashes of annexed content. Having access to the HMAC cipher
|
||
|
could perhaps be said to only let them verify that data they already
|
||
|
know.
|
||
|
|
||
|
While this seems a pretty persuasive argument, I'm not 100% convinced, and
|
||
|
anyway, most times that the HMAC cipher is needed, the gpg cipher is also
|
||
|
needed. Keeping the HMAC cipher encrypted does slow down two things:
|
||
|
dropping content from encrypted remotes, and checking if encrypted remotes
|
||
|
really have content. If it's later determined to be safe to not encrypt the
|
||
|
HMAC cipher, the current design allows changing that, even for existing
|
||
|
remotes.
|
||
|
|
||
|
## other use of the symmetric cipher
|
||
|
|
||
|
The symmetric cipher can be used to encrypt other content than the content
|
||
|
sent to the remote. In particular, it may make sense to encrypt whatever
|
||
|
access keys are used by the special remote with the cipher, and store that
|
||
|
in remotes.log. This way anyone whose gpg key has been given access to
|
||
|
the cipher can get access to whatever other credentials are needed to
|
||
|
use the special remote.
|
||
|
|
||
|
## risks
|
||
|
|
||
|
A risk of this scheme is that, once the symmetric cipher has been obtained, it
|
||
|
allows full access to all the encrypted content. This scheme does not allow
|
||
|
revoking a given gpg key access to the cipher, since anyone with such a key
|
||
|
could have already decrypted the cipher and stored a copy.
|
||
|
|
||
|
If git-annex stores the decrypted symmetric cipher in memory, then there
|
||
|
is a risk that it could be intercepted from there by an attacker. Gpg
|
||
|
amelorates these type of risks by using locked memory. For git-annex, note
|
||
|
that an attacker with local machine access can tell at least all the
|
||
|
filenames and metadata of files stored in the encrypted remote anyway,
|
||
|
and can access whatever content is stored locally.
|
||
|
|
||
|
This design does not support obfuscating the size of files by chunking
|
||
|
them, as that would have added a lot of complexity, for dubious benefits.
|
||
|
If the untrusted party running the encrypted remote wants to know file sizes,
|
||
|
they could correlate chunks that are accessed together. Encrypting data
|
||
|
changes the original file size enough to avoid it being used as a direct
|
||
|
fingerprint at least.
|