117 lines
		
	
	
	
		
			5.5 KiB
			
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			117 lines
		
	
	
	
		
			5.5 KiB
			
		
	
	
	
		
			Markdown
		
	
	
	
	
	
This was the design doc for [[/encryption]] and is preserved for
 | 
						|
the curious. For an example of using git-annex with an encrypted S3 remote,
 | 
						|
see [[tips/using_Amazon_S3]].
 | 
						|
 | 
						|
[[!toc]]
 | 
						|
 | 
						|
## encryption backends
 | 
						|
 | 
						|
It makes sense to support multiple encryption backends. So, there
 | 
						|
should be a way to tell what backend is responsible for a given filename
 | 
						|
in an encrypted remote. (And since special remotes can also store files
 | 
						|
unencrypted, differentiate from those as well.)
 | 
						|
 | 
						|
The rest of this page will describe a single encryption backend using GPG.
 | 
						|
Probably only one will be needed, but who knows? Maybe that backend will
 | 
						|
turn out badly designed, or some other encryptor needed. Designing
 | 
						|
with more than one encryption backend in mind helps future-proofing.
 | 
						|
 | 
						|
## encryption key management
 | 
						|
 | 
						|
[[!template id=note text="""
 | 
						|
The basis of this scheme was originally developed by Lars Wirzenius et al
 | 
						|
[for Obnam](http://liw.fi/obnam/encryption/).
 | 
						|
"""]]
 | 
						|
 | 
						|
Data is encrypted by gpg, using a symmetric cipher.
 | 
						|
The cipher is itself checked into your git repository, encrypted using one or
 | 
						|
more gpg public keys. This scheme allows new gpg private keys to be given
 | 
						|
access to content that has already been stored in the remote.
 | 
						|
 | 
						|
Different encrypted remotes need to be able to each use different ciphers.
 | 
						|
Allowing multiple ciphers to be used within a single remote would add a lot
 | 
						|
of complexity, so is not planned to be supported.
 | 
						|
Instead, if you want a new cipher, create a new S3 bucket, or whatever.
 | 
						|
There does not seem to be much benefit to using the same cipher for
 | 
						|
two different encrypted remotes.
 | 
						|
 | 
						|
So, the encrypted cipher could just be stored with the rest of a remote's
 | 
						|
configuration in `remotes.log` (see [[internals]]). When `git
 | 
						|
annex intiremote` makes a remote, it can generate a random symmetric
 | 
						|
cipher, and encrypt it with the specified gpg key. To allow another gpg
 | 
						|
public key access, update the encrypted cipher to be encrypted to both gpg
 | 
						|
keys.
 | 
						|
 | 
						|
## filename enumeration
 | 
						|
 | 
						|
If the names of files are encrypted or securely hashed, or whatever is
 | 
						|
chosen, this makes it harder for git-annex (let alone untrusted third parties!)
 | 
						|
to get a list of the files that are stored on a given enrypted remote.
 | 
						|
But, does git-annex really ever need to do such an enumeration?
 | 
						|
 | 
						|
Apparently not. `git annex unused --from remote` can now check for
 | 
						|
unused data that is stored on a remote, and it does so based only on
 | 
						|
location log data for the remote. This assumes that the location log is
 | 
						|
kept accurately.
 | 
						|
 | 
						|
What about `git annex fsck --from remote`? Such a command should be able to,
 | 
						|
for each file in the repository, contact the encrypted remote to check
 | 
						|
if it has the file. This can be done without enumeration, although it will
 | 
						|
mean running gpg once per file fscked, to get the encrypted filename.
 | 
						|
 | 
						|
So, the files stored in the remote should be encrypted. But, it needs
 | 
						|
to be a repeatable encryption, so they cannot just be gpg encrypted,
 | 
						|
that would yeild a new name each time. Instead, HMAC is used. Any hash
 | 
						|
could be used with HMAC; currently SHA1 is used.
 | 
						|
 | 
						|
It was suggested that it might not be wise to use the same cipher for both
 | 
						|
gpg and HMAC. Being paranoid, it's best not to tie the security of one
 | 
						|
to the security of the other. So, the encrypted cipher described above is
 | 
						|
actually split in two; half is used for HMAC, and half for gpg.
 | 
						|
 | 
						|
----
 | 
						|
 | 
						|
Does the HMAC cipher need to be gpg encrypted? Imagine if it were
 | 
						|
stored in plainext in the git repository. Anyone who can access
 | 
						|
the git repository already knows the actual filenames, and typically also
 | 
						|
the content hashes of annexed content. Having access to the HMAC cipher
 | 
						|
could perhaps be said to only let them verify that data they already
 | 
						|
know.
 | 
						|
 | 
						|
While this seems a pretty persuasive argument, I'm not 100% convinced, and
 | 
						|
anyway, most times that the HMAC cipher is needed, the gpg cipher is also
 | 
						|
needed. Keeping the HMAC cipher encrypted does slow down two things:
 | 
						|
dropping content from encrypted remotes, and checking if encrypted remotes
 | 
						|
really have content. If it's later determined to be safe to not encrypt the
 | 
						|
HMAC cipher, the current design allows changing that, even for existing
 | 
						|
remotes.
 | 
						|
 | 
						|
## other use of the symmetric cipher
 | 
						|
 | 
						|
The symmetric cipher can be used to encrypt other content than the content
 | 
						|
sent to the remote. In particular, it may make sense to encrypt whatever
 | 
						|
access keys are used by the special remote with the cipher, and store that
 | 
						|
in remotes.log. This way anyone whose gpg key has been given access to 
 | 
						|
the cipher can get access to whatever other credentials are needed to
 | 
						|
use the special remote.
 | 
						|
 | 
						|
## risks
 | 
						|
 | 
						|
A risk of this scheme is that, once the symmetric cipher has been obtained, it
 | 
						|
allows full access to all the encrypted content. This scheme does not allow
 | 
						|
revoking a given gpg key access to the cipher, since anyone with such a key
 | 
						|
could have already decrypted the cipher and stored a copy. 
 | 
						|
 | 
						|
If git-annex stores the decrypted symmetric cipher in memory, then there
 | 
						|
is a risk that it could be intercepted from there by an attacker. Gpg
 | 
						|
amelorates these type of risks by using locked memory. For git-annex, note
 | 
						|
that an attacker with local machine access can tell at least all the
 | 
						|
filenames and metadata of files stored in the encrypted remote anyway,
 | 
						|
and can access whatever content is stored locally.
 | 
						|
 | 
						|
This design does not support obfuscating the size of files by chunking
 | 
						|
them, as that would have added a lot of complexity, for dubious benefits.
 | 
						|
If the untrusted party running the encrypted remote wants to know file sizes,
 | 
						|
they could correlate chunks that are accessed together. Encrypting data
 | 
						|
changes the original file size enough to avoid it being used as a direct
 | 
						|
fingerprint at least.
 |