git-annex/doc/todo/sha1_collision_embedding_in_git-annex_keys.mdwn

Some git-annex backends allow embedding enough data in the names of keys
that it could be used for a SHA1 collision attack. So, a signed git commit
could point to a tree with such a key in it, and the blob for the key could
have two versions with the same SHA1.

> All issues below are [[done]] --[[Joey]]

Users who want to use git-annex with signed commits to mitigate git's own
SHA1 insecurities would like at least a way to disable the insecure
git-annex backends:

* WORM can contain fairly arbitrary data in a key name
* URL too (also, of course, URLs download arbitrary data from the web,
  so a signed git commit pointing at URL keys doesn't have any security
  even w/o SHA1 collisions)
* SHA1 and MD5 backends are insecure because there can be colliding
  versions of the data they point to.

There could be a config setting, which would prevent git-annex from using
keys with such insecure backends. A user who checks git commit signatures
could enable the config setting when they initially clone their repository.
This should prevent any file contents using insecure backends from being
downloaded into the repository. (Even git-annex-shell recvkey would
refuse data using such a key, since it would fail parsing the key.)
The user would thus know that any file contents in their repository match
the files in signed git commits.

Enabling the config setting in a repository that already contains
file contents would be a mistake, because it might contain insecure keys.
And since git-annex would skip over such files, `git annex fsck` cannot
warn about such a mistake.

Perhaps, then, the config setting should be turned on by `git annex init`?
Or, we can document this gotcha.

> I've done some groundwork for this, but making git-annex not accept
> insecure keys into the repo at all requires changing file2key,
> which is a pure function that's used in eg, instances for serialization.
> 
> So, how to make it vary depending on git config? Can't. Alternative
> would be to add lots of checks everywhere a key is read from disk
> or network, which feels like it would be a hard security boundary to
> manage.
> 
> It doesn't really matter if content under an insecure key is in the
> repo, as long as there's not a signed commit referencing such a key.
> So, we could say, this is up to the user constucting a signed commit, to not
> put such keys in the commit.
> 
> Or, we could use the pre-commit hook, and when 
> the config setting disallows insecure keys, make it reject commits
> that contain them. But, if a past commit added a file using an insecure
> key, and the current commit does not touch it, should it be rejected?
> Rejecting it would then require a somewhat expensive look at the tree
> being committed.
> 
> The user might be merging a branch from someone else; there seems no
> git hook that can sanity check a fast-forward merge.
> 
> Perhaps leave it up to the person making signed commits to get it 
> right, and make git annex fsck warn about such keys? That seems
> reasonable. --[[Joey]]

> > Rather than preventing SHA1/URL/WORM Keys, could put checks in
> > Annex.Content.moveAnnex to prevent SHA1/URL/WORM objects reaching the
> > repository. That would make moveAnnex a security boundary, which is is
> > not currently. Would need to audid to check if anything else populates
> > .git/annex/objects.
> >
> > Annex.Transfer.runTransfer could also check for disallowed objects,
> > not as a security boundary, but to prevent accidental expensive
> > transfers that would fail at the moveAnnex stage.

> > As to how to enable this, it may make sense to use git-annex-config
> > but only read the value from the git-annex branch when initializing the
> > repository, and cache it in git-config.
> > 
> > This way, a repository can be created and configured not to allow
> > SHA1/URL/WORM, and all clones will inherit this configuration. 
> >
> > Users can also set it in git-config on a per repository basis.
> > 
> > If the git-annex-config setting is changed, existing clone's won't
> > change their behavior, although new ones will.  That's a mixed
> > blessing; it makes it harder to switch an existing repo to disallowing
> > SHA1/URL/WORM, but an accidental/malicious re-enabling won't affect
> > clones made while it was disabled. 
> > > This is done now.
> > 
> > Could a repository be configured to either always disallow
> > SHA1/URL/WORM, or always allow them, and then not let that be changed?
> > Maybe -- Look through all the history of the git-annex branch from the
> > earliest commit forward. The first value stored in 
> > git-annex/disableinsecurehashes (eg 0 or 1) is the value to use;
> > any later changes are ignored. 
> > That would be a little slow, but only needs to be done at init time.
> > It might be possible to fool this though. Create a new empty branch,
> > with an old date, make a commit enabling insecure hashes, and
> > merge it into git-annex branch HEAD. It now looks as if insecure hashes
> > were disabled earliest.

> > > Well, annex.securehashesonly is implemented now. It currently needs to be
> > > set in each clone that cares about it. --[[Joey]]

----

A few other potential problems:

* A symlink target like .git/annex/objects/XX/YY/SHA256--foo
  might be able to be manipulated to add collision data in the path.
  For example, .git/annex/objects/collisiondata/../XX/YY/SHA256--foo

  I think this is not a valid attack, because at least on linux,
  such a symlink won't be followed, unless the
  .git/annex/objects/collisiondata directory exists.

* `*E` backends could embed sha1 collision data in a long filename
  extension in a key.

  The recent SHA1 common-prefix attack could be used to exploit this;
  the result would be two keys that have the same SHA1.

  This can be fixed by limiting the length
  of an extension allowed in such a key to the longest such extension
  git-annex has ever supported (probably < 20 bytes or so), which would
  be less than the size of the data needed for current SHA1 collision
  attacks. Now done; git-annex refuses to use keys with super
  long extensions.

* It might be possible to embed colliding data in a specially constructed
  key name with an extra field in it, eg "SHA256-cXXXXXXXXXXXXXXX-...".
  Need to review the code and see if such extra fields are allowed.  

  Update: All fields are numeric, but could contain arbitrary data
  after the number. Could have been used in a common-prefix attack.
  This has been fixed; git-annex refuses to parse
  such fields, so it won't work with files that try to exploit this.

* A symlink target like .git/annex/objects/XX/YY/SHA256--foo
  might be able to be manipulated to add collision data in the path.
  For example, .git/annex/objects/collisiondata/../XX/YY/SHA256--foo

  I think this is not a valid attack, because at least on linux,
  such a symlink won't be followed, unless the
  .git/annex/objects/collisiondata directory exists.
Tighten key parser to not accept keys containing a non-numeric fields, which could be used to embed data useful for a SHA1 attack against git. Also todo about why this is important, and with some further hardening to add. This commit was sponsored by Ignacio on Patreon. 2017-02-24 04:17:25 +00:00			`Some git-annex backends allow embedding enough data in the names of keys`
			`that it could be used for a SHA1 collision attack. So, a signed git commit`
			`could point to a tree with such a key in it, and the blob for the key could`
			`have two versions with the same SHA1.`

inheritable annex.securehashesonly * init: When annex.securehashesonly has been set with git-annex config, copy that value to the annex.securehashesonly git config. * config --set: As well as setting value in git-annex branch, set local gitconfig. This is needed especially for annex.securehashesonly, which is read only from local gitconfig and not the git-annex branch. doc/todo/sha1_collision_embedding_in_git-annex_keys.mdwn has the rationalle for doing it this way. There's no perfect solution; this seems to be the least-bad one. This commit was supported by the NSF-funded DataLad project. 2017-02-27 20:08:16 +00:00			`> All issues below are [[done]] --[[Joey]]`

Tighten key parser to not accept keys containing a non-numeric fields, which could be used to embed data useful for a SHA1 attack against git. Also todo about why this is important, and with some further hardening to add. This commit was sponsored by Ignacio on Patreon. 2017-02-24 04:17:25 +00:00			`Users who want to use git-annex with signed commits to mitigate git's own`
			`SHA1 insecurities would like at least a way to disable the insecure`
			`git-annex backends:`

			`* WORM can contain fairly arbitrary data in a key name`
			`* URL too (also, of course, URLs download arbitrary data from the web,`
			`so a signed git commit pointing at URL keys doesn't have any security`
			`even w/o SHA1 collisions)`
			`* SHA1 and MD5 backends are insecure because there can be colliding`
			`versions of the data they point to.`

update 2017-02-24 16:28:10 +00:00			`There could be a config setting, which would prevent git-annex from using`
			`keys with such insecure backends. A user who checks git commit signatures`
			`could enable the config setting when they initially clone their repository.`
			`This should prevent any file contents using insecure backends from being`
			`downloaded into the repository. (Even git-annex-shell recvkey would`
			`refuse data using such a key, since it would fail parsing the key.)`
			`The user would thus know that any file contents in their repository match`
			`the files in signed git commits.`
Tighten key parser to not accept keys containing a non-numeric fields, which could be used to embed data useful for a SHA1 attack against git. Also todo about why this is important, and with some further hardening to add. This commit was sponsored by Ignacio on Patreon. 2017-02-24 04:17:25 +00:00
update 2017-02-24 16:28:10 +00:00			`Enabling the config setting in a repository that already contains`
			`file contents would be a mistake, because it might contain insecure keys.`
			And since git-annex would skip over such files, `git annex fsck` cannot
			`warn about such a mistake.`

			Perhaps, then, the config setting should be turned on by `git annex init`?
			`Or, we can document this gotcha.`

update 2017-02-24 21:57:21 +00:00			`> I've done some groundwork for this, but making git-annex not accept`
			`> insecure keys into the repo at all requires changing file2key,`
further thoughts 2017-02-25 16:55:38 +00:00			`> which is a pure function that's used in eg, instances for serialization.`
update 2017-02-24 21:57:21 +00:00			`>`
			`> So, how to make it vary depending on git config? Can't. Alternative`
			`> would be to add lots of checks everywhere a key is read from disk`
			`> or network, which feels like it would be a hard security boundary to`
			`> manage.`
			`>`
			`> It doesn't really matter if content under an insecure key is in the`
			`> repo, as long as there's not a signed commit referencing such a key.`
			`> So, we could say, this is up to the user constucting a signed commit, to not`
			`> put such keys in the commit.`
			`>`
			`> Or, we could use the pre-commit hook, and when`
			`> the config setting disallows insecure keys, make it reject commits`
			`> that contain them. But, if a past commit added a file using an insecure`
			`> key, and the current commit does not touch it, should it be rejected?`
			`> Rejecting it would then require a somewhat expensive look at the tree`
			`> being committed.`
			`>`
			`> The user might be merging a branch from someone else; there seems no`
			`> git hook that can sanity check a fast-forward merge.`
			`>`
			`> Perhaps leave it up to the person making signed commits to get it`
			`> right, and make git annex fsck warn about such keys? That seems`
			`> reasonable. --[[Joey]]`

more thoughts 2017-02-25 18:49:44 +00:00			`> > Rather than preventing SHA1/URL/WORM Keys, could put checks in`
			`> > Annex.Content.moveAnnex to prevent SHA1/URL/WORM objects reaching the`
further thoughts 2017-02-25 16:55:38 +00:00			`> > repository. That would make moveAnnex a security boundary, which is is`
			`> > not currently. Would need to audid to check if anything else populates`
			`> > .git/annex/objects.`
			`> >`
			`> > Annex.Transfer.runTransfer could also check for disallowed objects,`
			`> > not as a security boundary, but to prevent accidental expensive`
			`> > transfers that would fail at the moveAnnex stage.`

			`> > As to how to enable this, it may make sense to use git-annex-config`
			`> > but only read the value from the git-annex branch when initializing the`
			`> > repository, and cache it in git-config.`
			`> >`
more thoughts 2017-02-25 18:49:44 +00:00			`> > This way, a repository can be created and configured not to allow`
			`> > SHA1/URL/WORM, and all clones will inherit this configuration.`
further thoughts 2017-02-25 16:55:38 +00:00			`> >`
			`> > Users can also set it in git-config on a per repository basis.`
			`> >`
			`> > If the git-annex-config setting is changed, existing clone's won't`
			`> > change their behavior, although new ones will. That's a mixed`
			`> > blessing; it makes it harder to switch an existing repo to disallowing`
more thoughts 2017-02-25 18:49:44 +00:00			`> > SHA1/URL/WORM, but an accidental/malicious re-enabling won't affect`
inheritable annex.securehashesonly * init: When annex.securehashesonly has been set with git-annex config, copy that value to the annex.securehashesonly git config. * config --set: As well as setting value in git-annex branch, set local gitconfig. This is needed especially for annex.securehashesonly, which is read only from local gitconfig and not the git-annex branch. doc/todo/sha1_collision_embedding_in_git-annex_keys.mdwn has the rationalle for doing it this way. There's no perfect solution; this seems to be the least-bad one. This commit was supported by the NSF-funded DataLad project. 2017-02-27 20:08:16 +00:00			`> > clones made while it was disabled.`
			`> > > This is done now.`
more thoughts 2017-02-25 18:49:44 +00:00			`> >`
			`> > Could a repository be configured to either always disallow`
			`> > SHA1/URL/WORM, or always allow them, and then not let that be changed?`
move thoughts 2017-02-25 19:00:22 +00:00			`> > Maybe -- Look through all the history of the git-annex branch from the`
more thoughts 2017-02-25 18:49:44 +00:00			`> > earliest commit forward. The first value stored in`
			`> > git-annex/disableinsecurehashes (eg 0 or 1) is the value to use;`
			`> > any later changes are ignored.`
			`> > That would be a little slow, but only needs to be done at init time.`
move thoughts 2017-02-25 19:00:22 +00:00			`> > It might be possible to fool this though. Create a new empty branch,`
			`> > with an old date, make a commit enabling insecure hashes, and`
			`> > merge it into git-annex branch HEAD. It now looks as if insecure hashes`
			`> > were disabled earliest.`
further thoughts 2017-02-25 16:55:38 +00:00
update 2017-02-27 19:29:23 +00:00			`> > > Well, annex.securehashesonly is implemented now. It currently needs to be`
			`> > > set in each clone that cares about it. --[[Joey]]`

update 2017-02-24 16:28:10 +00:00			`----`
Tighten key parser to not accept keys containing a non-numeric fields, which could be used to embed data useful for a SHA1 attack against git. Also todo about why this is important, and with some further hardening to add. This commit was sponsored by Ignacio on Patreon. 2017-02-24 04:17:25 +00:00
			`A few other potential problems:`

update 2017-02-24 16:28:10 +00:00			`* A symlink target like .git/annex/objects/XX/YY/SHA256--foo`
			`might be able to be manipulated to add collision data in the path.`
			`For example, .git/annex/objects/collisiondata/../XX/YY/SHA256--foo`

			`I think this is not a valid attack, because at least on linux,`
			`such a symlink won't be followed, unless the`
			`.git/annex/objects/collisiondata directory exists.`

Tighten key parser to not accept keys containing a non-numeric fields, which could be used to embed data useful for a SHA1 attack against git. Also todo about why this is important, and with some further hardening to add. This commit was sponsored by Ignacio on Patreon. 2017-02-24 04:17:25 +00:00			* `*E` backends could embed sha1 collision data in a long filename
update 2017-02-24 05:21:54 +00:00			`extension in a key.`

SHA1 collisions in key names was more exploitable than I thought Yesterday's SHA1 collision attack could be used to generate eg: SHA256-sfoo--whatever.good SHA256-sfoo--whatever.bad Such that they collide. A repository with the good one could have the bad one swapped in and signed commits would still verify. I've already mitigated this. 2017-02-24 23:54:36 +00:00			`The recent SHA1 common-prefix attack could be used to exploit this;`
			`the result would be two keys that have the same SHA1.`

			`This can be fixed by limiting the length`
update 2017-02-24 04:21:58 +00:00			`of an extension allowed in such a key to the longest such extension`
			`git-annex has ever supported (probably < 20 bytes or so), which would`
update 2017-02-24 05:21:54 +00:00			`be less than the size of the data needed for current SHA1 collision`
update 2017-02-24 16:28:10 +00:00			`attacks. Now done; git-annex refuses to use keys with super`
make file2key reject E* backend keys with a long extension I am not happy that I had to put backend-specific code in file2key. But it would be very difficult to avoid this layering violation. Most of the time, when parsing a Key from a symlink target, git-annex never looks up its Backend at all, so adding this check to a method of the Backend object would not work. The Key could be made to contain the appropriate Backend, but since Backend is parameterized on an "a" that is fixed to the Annex monad later, that would need Key to change to "Key a". The only way to clean this up that I can see would be to have the Key contain a LowlevelBackend, and put the validation in LowlevelBackend. Perhaps later, but that would be an extensive change, so let's not do it in this commit which may want to cherry-pick to backports. This commit was sponsored by Ethan Aubin. 2017-02-24 15:17:07 +00:00			`long extensions.`
update 2017-02-24 05:21:54 +00:00
Tighten key parser to not accept keys containing a non-numeric fields, which could be used to embed data useful for a SHA1 attack against git. Also todo about why this is important, and with some further hardening to add. This commit was sponsored by Ignacio on Patreon. 2017-02-24 04:17:25 +00:00			`* It might be possible to embed colliding data in a specially constructed`
			`key name with an extra field in it, eg "SHA256-cXXXXXXXXXXXXXXX-...".`
			`Need to review the code and see if such extra fields are allowed.`
updates 2017-02-24 04:28:15 +00:00
Tighten key parser to not accept keys containing a non-numeric fields, which could be used to embed data useful for a SHA1 attack against git. Also todo about why this is important, and with some further hardening to add. This commit was sponsored by Ignacio on Patreon. 2017-02-24 04:17:25 +00:00			`Update: All fields are numeric, but could contain arbitrary data`
SHA1 collisions in key names was more exploitable than I thought Yesterday's SHA1 collision attack could be used to generate eg: SHA256-sfoo--whatever.good SHA256-sfoo--whatever.bad Such that they collide. A repository with the good one could have the bad one swapped in and signed commits would still verify. I've already mitigated this. 2017-02-24 23:54:36 +00:00			`after the number. Could have been used in a common-prefix attack.`
update 2017-02-24 16:28:10 +00:00			`This has been fixed; git-annex refuses to parse`
updates 2017-02-24 04:28:15 +00:00			`such fields, so it won't work with files that try to exploit this.`
update 2017-02-24 16:28:10 +00:00
			`* A symlink target like .git/annex/objects/XX/YY/SHA256--foo`
			`might be able to be manipulated to add collision data in the path.`
			`For example, .git/annex/objects/collisiondata/../XX/YY/SHA256--foo`

			`I think this is not a valid attack, because at least on linux,`
			`such a symlink won't be followed, unless the`
			`.git/annex/objects/collisiondata directory exists.`