118 lines
5.4 KiB
Markdown
118 lines
5.4 KiB
Markdown
This was the design doc for [[/encryption]] and is preserved for
|
|
the curious. For an example of using git-annex with an encrypted S3 remote,
|
|
see [[tips/using_Amazon_S3]].
|
|
|
|
[[!toc]]
|
|
|
|
## encryption key management
|
|
|
|
[[!template id=note text="""
|
|
The basis of this scheme was originally developed by Lars Wirzenius et al
|
|
[for Obnam](http://liw.fi/obnam/encryption/).
|
|
"""]]
|
|
|
|
Data is encrypted by GnuPG, using a symmetric cipher. The cipher is
|
|
generated by GnuPG when the special remote is created. By default the
|
|
best entropy pool is used, hence the generation may take a while; One
|
|
can use `initremote` with the `--fast` option
|
|
to speed up things, but at the expense of using random numbers of a
|
|
lower quality. The generated cipher is then checked into your git
|
|
repository, encrypted using one or more OpenPGP public keys. This scheme
|
|
allows new OpenPGP private keys to be given access to content that has
|
|
already been stored in the remote.
|
|
|
|
Different encrypted remotes need to be able to each use different ciphers.
|
|
Allowing multiple ciphers to be used within a single remote would add a lot
|
|
of complexity, so is not supported.
|
|
Instead, if you want a new cipher, create a new S3 bucket, or whatever.
|
|
There does not seem to be much benefit to using the same cipher for
|
|
two different encrypted remotes.
|
|
|
|
So, the encrypted cipher is just stored with the rest of a remote's
|
|
configuration in `remote.log` (see [[internals]]). When `git
|
|
annex initremote` makes a remote, it generates a random symmetric
|
|
cipher, and encrypt it with the specified gpg key. To allow another gpg
|
|
public key access, update the encrypted cipher to be encrypted to both gpg
|
|
keys.
|
|
|
|
Note that there's a shared encryption mode where the cipher is not
|
|
encrypted. When this mode is used, any clone of the git repository
|
|
can decrypt files stored in its special remote.
|
|
|
|
## filename enumeration
|
|
|
|
If the names of files are encrypted or securely hashed, or whatever is
|
|
chosen, this makes it harder for git-annex (let alone untrusted third parties!)
|
|
to get a list of the files that are stored on a given enrypted remote.
|
|
But, does git-annex really ever need to do such an enumeration?
|
|
|
|
Apparently not. `git annex unused --from remote` can now check for
|
|
unused data that is stored on a remote, and it does so based only on
|
|
location log data for the remote. This assumes that the location log is
|
|
kept accurately.
|
|
|
|
What about `git annex fsck --from remote`? Such a command should be able to,
|
|
for each file in the repository, contact the encrypted remote to check
|
|
if it has the file. This can be done without enumeration, although it will
|
|
mean running gpg once per file fscked, to get the encrypted filename.
|
|
|
|
So, the files stored in the remote should be encrypted. But, it needs to
|
|
be a repeatable encryption, so they cannot just be gpg encrypted, that
|
|
would yeild a new name each time. Instead, HMAC is used. Any hash could
|
|
be used with HMAC. SHA-1 is the default, but [[other_hashes|/encryption]]
|
|
can be chosen for new remotes.
|
|
|
|
It was suggested that it might not be wise to use the same cipher for both
|
|
gpg and HMAC. Being paranoid, it's best not to tie the security of one
|
|
to the security of the other. So, the encrypted cipher described above is
|
|
actually split in two; the first half is used for HMAC, and the second
|
|
half for gpg.
|
|
|
|
----
|
|
|
|
Does the HMAC cipher need to be gpg encrypted? Imagine if it were
|
|
stored in plainext in the git repository. Anyone who can access
|
|
the git repository already knows the actual filenames, and typically also
|
|
the content hashes of annexed content. Having access to the HMAC cipher
|
|
could perhaps be said to only let them verify that data they already
|
|
know.
|
|
|
|
While this seems a pretty persuasive argument, I'm not 100% convinced, and
|
|
anyway, most times that the HMAC cipher is needed, the gpg cipher is also
|
|
needed. Keeping the HMAC cipher encrypted does slow down two things:
|
|
dropping content from encrypted remotes, and checking if encrypted remotes
|
|
really have content. If it's later determined to be safe to not encrypt the
|
|
HMAC cipher, the current design allows changing that, even for existing
|
|
remotes.
|
|
|
|
## other use of the symmetric cipher
|
|
|
|
The symmetric cipher can be used to encrypt other content than the content
|
|
sent to the remote. In particular, it may make sense to encrypt whatever
|
|
access keys are used by the special remote with the cipher, and store that
|
|
in `remote.log`. This way anyone whose gpg key has been given access to
|
|
the cipher can get access to whatever other credentials are needed to
|
|
use the special remote.
|
|
|
|
For example, the S3 special remote does this if configured with
|
|
embedcreds=yet.
|
|
|
|
## risks
|
|
|
|
A risk of this scheme is that, once the symmetric cipher has been
|
|
obtained, it allows full access to all the encrypted content. Indeed
|
|
anyone owning a key that used to be granted access could already have
|
|
decrypted the cipher and stored a copy. While it is in possible to
|
|
remove a key with `keyid-=`, it is designed for a
|
|
[[completely_different_purpose|/encryption]] and does not actually revoke
|
|
access.
|
|
|
|
If git-annex stores the decrypted symmetric cipher in memory, then there
|
|
is a risk that it could be intercepted from there by an attacker. Gpg
|
|
ameliorates these type of risks by using locked memory. For git-annex, note
|
|
that an attacker with local machine access can tell at least all the
|
|
filenames and metadata of files stored in the encrypted remote anyway,
|
|
and can access whatever content is stored locally.
|
|
|
|
This design does not address obfuscating the size of files by chunking
|
|
them. However, chunking was later added; see [[design/assistant/chunks]].
|