git-annex/doc/design/encryption.mdwn

This was the design doc for [[/encryption]] and is preserved for
the curious. For an example of using git-annex with an encrypted S3 remote,
see [[tips/using_Amazon_S3]].

[[!toc]]

## encryption backends

It makes sense to support multiple encryption backends. So, there
should be a way to tell what backend is responsible for a given filename
in an encrypted remote. (And since special remotes can also store files
unencrypted, differentiate from those as well.)

The rest of this page will describe a single encryption backend using GPG.
Probably only one will be needed, but who knows? Maybe that backend will
turn out badly designed, or some other encryptor needed. Designing
with more than one encryption backend in mind helps future-proofing.

## encryption key management

[[!template id=note text="""
The basis of this scheme was originally developed by Lars Wirzenius et al
[for Obnam](http://braawi.org/obnam/encryption/).
"""]]

Data is encrypted by gpg, using a symmetric cipher.
The cipher is itself checked into your git repository, encrypted using one or
more gpg public keys. This scheme allows new gpg private keys to be given
access to content that has already been stored in the remote.

Different encrypted remotes need to be able to each use different ciphers.
Allowing multiple ciphers to be used within a single remote would add a lot
of complexity, so is not planned to be supported.
Instead, if you want a new cipher, create a new S3 bucket, or whatever.
There does not seem to be much benefit to using the same cipher for
two different encrypted remotes.

So, the encrypted cipher could just be stored with the rest of a remote's
configuration in `remotes.log` (see [[internals]]). When `git
annex intiremote` makes a remote, it can generate a random symmetric
cipher, and encrypt it with the specified gpg key. To allow another gpg
public key access, update the encrypted cipher to be encrypted to both gpg
keys.

## filename enumeration

If the names of files are encrypted or securely hashed, or whatever is
chosen, this makes it harder for git-annex (let alone untrusted third parties!)
to get a list of the files that are stored on a given enrypted remote.
But, does git-annex really ever need to do such an enumeration?

Apparently not. `git annex unused --from remote` can now check for
unused data that is stored on a remote, and it does so based only on
location log data for the remote. This assumes that the location log is
kept accurately.

What about `git annex fsck --from remote`? Such a command should be able to,
for each file in the repository, contact the encrypted remote to check
if it has the file. This can be done without enumeration, although it will
mean running gpg once per file fscked, to get the encrypted filename.

So, the files stored in the remote should be encrypted. But, it needs
to be a repeatable encryption, so they cannot just be gpg encrypted,
that would yeild a new name each time. Instead, HMAC is used. Any hash
could be used with HMAC; currently SHA1 is used.

It was suggested that it might not be wise to use the same cipher for both
gpg and HMAC. Being paranoid, it's best not to tie the security of one
to the security of the other. So, the encrypted cipher described above is
actually split in two; half is used for HMAC, and half for gpg.

----

Does the HMAC cipher need to be gpg encrypted? Imagine if it were
stored in plainext in the git repository. Anyone who can access
the git repository already knows the actual filenames, and typically also
the content hashes of annexed content. Having access to the HMAC cipher
could perhaps be said to only let them verify that data they already
know.

While this seems a pretty persuasive argument, I'm not 100% convinced, and
anyway, most times that the HMAC cipher is needed, the gpg cipher is also
needed. Keeping the HMAC cipher encrypted does slow down two things:
dropping content from encrypted remotes, and checking if encrypted remotes
really have content. If it's later determined to be safe to not encrypt the
HMAC cipher, the current design allows changing that, even for existing
remotes.

## other use of the symmetric cipher

The symmetric cipher can be used to encrypt other content than the content
sent to the remote. In particular, it may make sense to encrypt whatever
access keys are used by the special remote with the cipher, and store that
in remotes.log. This way anyone whose gpg key has been given access to 
the cipher can get access to whatever other credentials are needed to
use the special remote.

## risks

A risk of this scheme is that, once the symmetric cipher has been obtained, it
allows full access to all the encrypted content. This scheme does not allow
revoking a given gpg key access to the cipher, since anyone with such a key
could have already decrypted the cipher and stored a copy. 

If git-annex stores the decrypted symmetric cipher in memory, then there
is a risk that it could be intercepted from there by an attacker. Gpg
amelorates these type of risks by using locked memory. For git-annex, note
that an attacker with local machine access can tell at least all the
filenames and metadata of files stored in the encrypted remote anyway,
and can access whatever content is stored locally.

This design does not support obfuscating the size of files by chunking
them, as that would have added a lot of complexity, for dubious benefits.
If the untrusted party running the encrypted remote wants to know file sizes,
they could correlate chunks that are accessed together. Encrypting data
changes the original file size enough to avoid it being used as a direct
fingerprint at least.
update 2011-05-13 18:55:27 +00:00			`This was the design doc for [[/encryption]] and is preserved for`
			`the curious. For an example of using git-annex with an encrypted S3 remote,`
fix link 2011-11-03 17:14:38 +00:00			`see [[tips/using_Amazon_S3]].`
encryption design document 2011-04-03 18:34:00 +00:00
update 2011-04-03 18:47:43 +00:00			`[[!toc]]`

encryption design document 2011-04-03 18:34:00 +00:00			`## encryption backends`

			`It makes sense to support multiple encryption backends. So, there`
			`should be a way to tell what backend is responsible for a given filename`
			`in an encrypted remote. (And since special remotes can also store files`
			`unencrypted, differentiate from those as well.)`

			`The rest of this page will describe a single encryption backend using GPG.`
			`Probably only one will be needed, but who knows? Maybe that backend will`
			`turn out badly designed, or some other encryptor needed. Designing`
			`with more than one encryption backend in mind helps future-proofing.`

			`## encryption key management`

			`[[!template id=note text="""`
			`The basis of this scheme was originally developed by Lars Wirzenius et al`
			`[for Obnam](http://braawi.org/obnam/encryption/).`
			`"""]]`

design wrapup 2011-04-17 15:27:24 +00:00			`Data is encrypted by gpg, using a symmetric cipher.`
			`The cipher is itself checked into your git repository, encrypted using one or`
encryption design document 2011-04-03 18:34:00 +00:00			`more gpg public keys. This scheme allows new gpg private keys to be given`
			`access to content that has already been stored in the remote.`

			`Different encrypted remotes need to be able to each use different ciphers.`
typo 2011-04-08 21:51:17 +00:00			`Allowing multiple ciphers to be used within a single remote would add a lot`
update 2011-04-07 20:05:30 +00:00			`of complexity, so is not planned to be supported.`
encryption design document 2011-04-03 18:34:00 +00:00			`Instead, if you want a new cipher, create a new S3 bucket, or whatever.`
			`There does not seem to be much benefit to using the same cipher for`
update 2011-05-13 18:55:27 +00:00			`two different encrypted remotes.`
encryption design document 2011-04-03 18:34:00 +00:00
			`So, the encrypted cipher could just be stored with the rest of a remote's`
update documentation that mentioned .git-annex/ 2011-06-22 21:26:34 +00:00			configuration in `remotes.log` (see [[internals]]). When `git
encryption design document 2011-04-03 18:34:00 +00:00			annex intiremote` makes a remote, it can generate a random symmetric
			`cipher, and encrypt it with the specified gpg key. To allow another gpg`
			`public key access, update the encrypted cipher to be encrypted to both gpg`
			`keys.`

			`## filename enumeration`

update 2011-04-03 18:53:12 +00:00			`If the names of files are encrypted or securely hashed, or whatever is`
			`chosen, this makes it harder for git-annex (let alone untrusted third parties!)`
			`to get a list of the files that are stored on a given enrypted remote.`
			`But, does git-annex really ever need to do such an enumeration?`
encryption design document 2011-04-03 18:34:00 +00:00
			Apparently not. `git annex unused --from remote` can now check for
			`unused data that is stored on a remote, and it does so based only on`
			`location log data for the remote. This assumes that the location log is`
			`kept accurately.`

			What about `git annex fsck --from remote`? Such a command should be able to,
			`for each file in the repository, contact the encrypted remote to check`
			`if it has the file. This can be done without enumeration, although it will`
			`mean running gpg once per file fscked, to get the encrypted filename.`

design wrapup 2011-04-17 15:27:24 +00:00			`So, the files stored in the remote should be encrypted. But, it needs`
			`to be a repeatable encryption, so they cannot just be gpg encrypted,`
			`that would yeild a new name each time. Instead, HMAC is used. Any hash`
			`could be used with HMAC; currently SHA1 is used.`

			`It was suggested that it might not be wise to use the same cipher for both`
			`gpg and HMAC. Being paranoid, it's best not to tie the security of one`
			`to the security of the other. So, the encrypted cipher described above is`
			`actually split in two; half is used for HMAC, and half for gpg.`

			`----`

			`Does the HMAC cipher need to be gpg encrypted? Imagine if it were`
			`stored in plainext in the git repository. Anyone who can access`
			`the git repository already knows the actual filenames, and typically also`
			`the content hashes of annexed content. Having access to the HMAC cipher`
			`could perhaps be said to only let them verify that data they already`
			`know.`

			`While this seems a pretty persuasive argument, I'm not 100% convinced, and`
			`anyway, most times that the HMAC cipher is needed, the gpg cipher is also`
			`needed. Keeping the HMAC cipher encrypted does slow down two things:`
			`dropping content from encrypted remotes, and checking if encrypted remotes`
			`really have content. If it's later determined to be safe to not encrypt the`
			`HMAC cipher, the current design allows changing that, even for existing`
			`remotes.`

mention that the cipher can also be used to crypt access keys 2011-05-01 18:09:07 +00:00			`## other use of the symmetric cipher`

			`The symmetric cipher can be used to encrypt other content than the content`
			`sent to the remote. In particular, it may make sense to encrypt whatever`
			`access keys are used by the special remote with the cipher, and store that`
			`in remotes.log. This way anyone whose gpg key has been given access to`
			`the cipher can get access to whatever other credentials are needed to`
			`use the special remote.`

update 2011-04-03 18:47:43 +00:00			`## risks`
encryption design document 2011-04-03 18:34:00 +00:00
			`A risk of this scheme is that, once the symmetric cipher has been obtained, it`
			`allows full access to all the encrypted content. This scheme does not allow`
			`revoking a given gpg key access to the cipher, since anyone with such a key`
			`could have already decrypted the cipher and stored a copy.`

			`If git-annex stores the decrypted symmetric cipher in memory, then there`
			`is a risk that it could be intercepted from there by an attacker. Gpg`
update 2011-04-03 19:51:24 +00:00			`amelorates these type of risks by using locked memory. For git-annex, note`
			`that an attacker with local machine access can tell at least all the`
			`filenames and metadata of files stored in the encrypted remote anyway,`
			`and can access whatever content is stored locally.`

encryption design document 2011-04-03 18:34:00 +00:00			`This design does not support obfuscating the size of files by chunking`
			`them, as that would have added a lot of complexity, for dubious benefits.`
			`If the untrusted party running the encrypted remote wants to know file sizes,`
update 2011-04-03 18:47:43 +00:00			`they could correlate chunks that are accessed together. Encrypting data`
encryption design document 2011-04-03 18:34:00 +00:00			`changes the original file size enough to avoid it being used as a direct`
			`fingerprint at least.`