git-annex/doc/design/external_backend_protocol.mdwn

Communication between git-annex and a program implementing an external
[[backend|backends]] uses this protocol.

[[!toc ]]

## starting the program

The external backend program has a name like `git-annex-backend-XFOO`.
When git-annex is configured to use a backend starting with "X",
or encounters a key in a repository starting with "X", it
looks for the corresponding external backend program in PATH.

The program is started by git-annex when it needs to use it, and may be
left running for a long period of time. Note that git-annex may choose to
run multiple instances of the program.

## protocol overview

Communication is via stdin and stdout. While stderr is connected to the
console and so visible to the user, the program should avoid using it
except for in the most exceptional circumstances.

The protocol is line based. git-annex sends a request, and the program
responds with a reply.

Each protocol line starts with a command, which is followed by the
command's parameters (a fixed number per command), each separated by a
single space. The last parameter may contain spaces. Parameters may be
empty, but the separating spaces are still required in that case.

## example session

git-annex always starts by sending a message asking the program what protocol
version it uses.

	GETVERSION

The program responds.

	VERSION 1

git-annex will next query the program about the properties of the keys it
uses (`CANVERIFY`, `ISSTABLE`, `ISCRYPTOGRAPHICALLYSECURE`), and the program will
respond to each query.

Then git-annex may ask the program to generate a key.

	GENKEY somefile

The program will respond with the key it generated, but if it needs to do
an expensive operation, such as hashing the file, it can first send
progress messages, indicating the position in the file it has processed.

	PROGRESS 1024
	PROGRESS 2048
	GENKEY-SUCCESS XFOO-s2048--dbd009

git-annex can also ask the program to verify if the content of a file
matches a key.

	VERIFYKEYCONTENT XFOO-s2048--dbd009 somefile

Again the program can send progress messages as it works, finishing
with the result of the verification.

	PROGRESS 1024
	PROGRESS 2048
	VERIFYKEYCONTENT-SUCCESS

## startup messages and replies

These messages are sent to the program soon after starting it, and it should
reply with one of the listed replies.

* `GETVERSION`
  Always the first message sent.
  Currently the only version of this protocol is version 1.
  * `VERSION 1`
* `CANVERIFY`
  Asks if the program can verify the content of files match a key it generated.
  The verification does not need to be cryptographically secure, but should
  catch data corruption.
  * `CANVERIFY-YES`
  * `CANVERIFY-NO`
* `ISSTABLE`
  Asks the program if a key it has generated will always have the same
  content. The answer to this is almost always yes; URL keys are an example
  of a type of key that may have different content at different times.
  * `ISSTABLE-YES`
  * `ISSTABLE-NO`
* `ISCRYPTOGRAPHICALLYSECURE`
  Asks the program if keys it generates are verified using a cryptographically
  secure hash. Note that sha1 is *not* a cryptographically secure hash any
  longer. A program can change its answer to this question as the state of the
  art advances, and should aim to stay ahead of the state of the art by a
  reasonable amount of time.
  * `ISCRYPTOGRAPHICALLYSECURE-YES`
  * `ISCRYPTOGRAPHICALLYSECURE-NO`

## main messages and replies

This is where work happens.

* `GENKEY Contentfile`
  The program should examine the ContentFile and from it generate a
  key. While it is doing this, it can send any number of `PROGRESS`
  messages indication the position in the file that it's gotten to.
  * `GENKEY-SUCCESS Key`
  * `GENKEY-FAILURE ErrorMsg`
* `VERIFYKEYCONTENT Key ContentFile`
  The program should examine the ContentFile and verify that it has the
  content it would expect for the Key. While it is doing this, it can
  send any number of `PROGRESS` messages indication the position in the
  file that it's gotten to. (If the program earlier sent `CANVERIFY-NO`,
  it will not be asked to do this.)
  * `VERIFYKEYCONTENT-SUCCESS`
  * `VERIFYKEYCONTENT-FAILURE`

## general messages

These messages can be sent at any time by either git-annex or the program.

* `DEBUG message`
  Tells git-annex to display the message if --debug is enabled.
  (git-annex does not send a reply to this message.)
* `ERROR ErrorMsg`
  Generic error. Can be sent at any time if things get too messed up to
  continue. When possible, use a more specific reply.
  The program should exit after sending this, as git-annex will not talk to
  it any further. If the program receives an `ERROR` from git-annex, it can
  exit with its own `ERROR`.

## considerations for generating keys

See [[internals/key_format]] for how to format a key and details about the
parts of a key.

The backend name should match the name of the program, eg if the program
is git-annex-backend-XFOO, it should generate a key starting with "XFOO-".

The backend name (and program name) has to be all uppercase, and should be
reasonably short (max 10 bytes or so), and should be entirely ascii
alphanumerics. Eg, use similar names to other [[backends]]. It must not end
with "E" (see next paragraph for why).

git-annex will automatically also support an "E" variant of the backend,
which adds a filename extension to the end of the key. It does this
entirely transparently to the program, so while the repository may be using
XFOOE keys, the program will always generate and verify XFOO keys.

The key name is typically some kind of hash, but is not limited to a hash.
The length of it needs to be similar to the lengths of other git-annex
keys. Too long a key name will make it annoying to work with repositories
using them, or even cause problems due to filename length limits. 128 bytes
maximum, but shorter is better. It should be entirely ascii characters
in the set `A-Za-z0-9` and `-` is allowed, but other punctuation is not.

It's important that, if the program responds with
`ISCRYPTOGRAPHICALLYSECURE-YES`, the key name contains only a hash, and not
other data from some other source. That other data could be used to try to
mount a sha1 collision attack against git, by embedding colliding material
in the key name, where users are unlikely to notice it. While git has
several things that make sha1 collision attacks difficult, we don't want
this chink in the armor.

It's almost always a good idea to include the size field when generating a
key. The size does not need to be checked when verifying content, as
git-annex handles that for you. The only time it would make sense to omit
the size field is if the content of a key is not stable and might have
different sizes (like some URL keys do).

There's generally no reason to include the mtime field, and it should
never be verified when verifying content.

## program names must be unique

It's important that two different programs don't use the same name, because
that would result in bad behavior if the wrong program were used with a
repository with keys generated by the other program.

To avoid picking the same name, there is a list of known external backend
programs in [[backends]].

## signals

The program should not block SIGINT, or SIGTERM. Doing so may cause
git-annex to hang waiting on it to exit. Of course it's ok to catch those
signals and do some necessary cleanup before exiting.