118 lines
		
	
	
	
		
			5.4 KiB
			
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			118 lines
		
	
	
	
		
			5.4 KiB
			
		
	
	
	
		
			Markdown
		
	
	
	
	
	
This was the design doc for [[/encryption]] and is preserved for
 | 
						|
the curious. For an example of using git-annex with an encrypted S3 remote,
 | 
						|
see [[tips/using_Amazon_S3]].
 | 
						|
 | 
						|
[[!toc]]
 | 
						|
 | 
						|
## encryption key management
 | 
						|
 | 
						|
[[!template id=note text="""
 | 
						|
The basis of this scheme was originally developed by Lars Wirzenius et al
 | 
						|
[for Obnam](http://liw.fi/obnam/encryption/).
 | 
						|
"""]]
 | 
						|
 | 
						|
Data is encrypted by GnuPG, using a symmetric cipher. The cipher is
 | 
						|
generated by GnuPG when the special remote is created. By default the
 | 
						|
best entropy pool is used, hence the generation may take a while; One
 | 
						|
can use `initremote` with the `--fast` option
 | 
						|
to speed up things, but at the expense of using random numbers of a
 | 
						|
lower quality. The generated cipher is then checked into your git
 | 
						|
repository, encrypted using one or more OpenPGP public keys. This scheme
 | 
						|
allows new OpenPGP private keys to be given access to content that has
 | 
						|
already been stored in the remote.
 | 
						|
 | 
						|
Different encrypted remotes need to be able to each use different ciphers.
 | 
						|
Allowing multiple ciphers to be used within a single remote would add a lot
 | 
						|
of complexity, so is not supported.
 | 
						|
Instead, if you want a new cipher, create a new S3 bucket, or whatever.
 | 
						|
There does not seem to be much benefit to using the same cipher for
 | 
						|
two different encrypted remotes.
 | 
						|
 | 
						|
So, the encrypted cipher is just stored with the rest of a remote's
 | 
						|
configuration in `remotes.log` (see [[internals]]). When `git
 | 
						|
annex intiremote` makes a remote, it generates a random symmetric
 | 
						|
cipher, and encrypt it with the specified gpg key. To allow another gpg
 | 
						|
public key access, update the encrypted cipher to be encrypted to both gpg
 | 
						|
keys.
 | 
						|
 | 
						|
Note that there's a shared encryption mode where the cipher is not
 | 
						|
encrypted. When this mode is used, any clone of the git repository
 | 
						|
can decrypt files stored in its special remote.
 | 
						|
 | 
						|
## filename enumeration
 | 
						|
 | 
						|
If the names of files are encrypted or securely hashed, or whatever is
 | 
						|
chosen, this makes it harder for git-annex (let alone untrusted third parties!)
 | 
						|
to get a list of the files that are stored on a given enrypted remote.
 | 
						|
But, does git-annex really ever need to do such an enumeration?
 | 
						|
 | 
						|
Apparently not. `git annex unused --from remote` can now check for
 | 
						|
unused data that is stored on a remote, and it does so based only on
 | 
						|
location log data for the remote. This assumes that the location log is
 | 
						|
kept accurately.
 | 
						|
 | 
						|
What about `git annex fsck --from remote`? Such a command should be able to,
 | 
						|
for each file in the repository, contact the encrypted remote to check
 | 
						|
if it has the file. This can be done without enumeration, although it will
 | 
						|
mean running gpg once per file fscked, to get the encrypted filename.
 | 
						|
 | 
						|
So, the files stored in the remote should be encrypted. But, it needs to
 | 
						|
be a repeatable encryption, so they cannot just be gpg encrypted, that
 | 
						|
would yeild a new name each time. Instead, HMAC is used. Any hash could
 | 
						|
be used with HMAC. SHA-1 is the default, but [[other_hashes|/encryption]]
 | 
						|
can be chosen for new remotes.
 | 
						|
 | 
						|
It was suggested that it might not be wise to use the same cipher for both
 | 
						|
gpg and HMAC. Being paranoid, it's best not to tie the security of one
 | 
						|
to the security of the other. So, the encrypted cipher described above is
 | 
						|
actually split in two; the first half is used for HMAC, and the second
 | 
						|
half for gpg.
 | 
						|
 | 
						|
----
 | 
						|
 | 
						|
Does the HMAC cipher need to be gpg encrypted? Imagine if it were
 | 
						|
stored in plainext in the git repository. Anyone who can access
 | 
						|
the git repository already knows the actual filenames, and typically also
 | 
						|
the content hashes of annexed content. Having access to the HMAC cipher
 | 
						|
could perhaps be said to only let them verify that data they already
 | 
						|
know.
 | 
						|
 | 
						|
While this seems a pretty persuasive argument, I'm not 100% convinced, and
 | 
						|
anyway, most times that the HMAC cipher is needed, the gpg cipher is also
 | 
						|
needed. Keeping the HMAC cipher encrypted does slow down two things:
 | 
						|
dropping content from encrypted remotes, and checking if encrypted remotes
 | 
						|
really have content. If it's later determined to be safe to not encrypt the
 | 
						|
HMAC cipher, the current design allows changing that, even for existing
 | 
						|
remotes.
 | 
						|
 | 
						|
## other use of the symmetric cipher
 | 
						|
 | 
						|
The symmetric cipher can be used to encrypt other content than the content
 | 
						|
sent to the remote. In particular, it may make sense to encrypt whatever
 | 
						|
access keys are used by the special remote with the cipher, and store that
 | 
						|
in remotes.log. This way anyone whose gpg key has been given access to 
 | 
						|
the cipher can get access to whatever other credentials are needed to
 | 
						|
use the special remote.
 | 
						|
 | 
						|
For example, the S3 special remote does this if configured with
 | 
						|
embedcreds=yet.
 | 
						|
 | 
						|
## risks
 | 
						|
 | 
						|
A risk of this scheme is that, once the symmetric cipher has been
 | 
						|
obtained, it allows full access to all the encrypted content. Indeed
 | 
						|
anyone owning a key that used to be granted access could already have
 | 
						|
decrypted the cipher and stored a copy. While it is in possible to
 | 
						|
remove a key with `keyid-=`, it is designed for a
 | 
						|
[[completely_different_purpose|/encryption]] and does not actually revoke
 | 
						|
access.
 | 
						|
 | 
						|
If git-annex stores the decrypted symmetric cipher in memory, then there
 | 
						|
is a risk that it could be intercepted from there by an attacker. Gpg
 | 
						|
ameliorates these type of risks by using locked memory. For git-annex, note
 | 
						|
that an attacker with local machine access can tell at least all the
 | 
						|
filenames and metadata of files stored in the encrypted remote anyway,
 | 
						|
and can access whatever content is stored locally.
 | 
						|
 | 
						|
This design does not address obfuscating the size of files by chunking
 | 
						|
them. However, chunking was later added; see [[design/assistant/chunks]].
 |