It was relying on segmentPaths to work correctly, so when it didn't,
sometimes the file that did not exist got matched up with a non-null
list of results. Fixed by always checking if each parameter exists.
There are two reason segmentPaths might not work correctly.
For one, it assumes that when the original list of paths
has more than 100 paths, it's not worth paying the CPU cost to
preserve input orders.
And then, it fails when a directory such as "." or ".." or
/path/to/repo is in the input list, and the list of found paths
does not start with that same thing. It should probably not be using
dirContains, but something else.
But, it's not clear how to handle this fully. Consider
when [".", "subdir"] has been expanded by git ls-files to
["subdir/1", "subdir/2"]
-- Both of the inputs contained those results, so there's
no one right answer for segmentPaths. All these would be equally valid:
[["subdir/1", "subdir/2"], []]
[[], ["subdir/1", "subdir/2"]]
[["subdir/1"], [""subdir/2"]]
So I've not tried to improve segmentPaths.
* init: When annex.securehashesonly has been set with git-annex config,
copy that value to the annex.securehashesonly git config.
* config --set: As well as setting value in git-annex branch,
set local gitconfig. This is needed especially for
annex.securehashesonly, which is read only from local gitconfig and not
the git-annex branch.
doc/todo/sha1_collision_embedding_in_git-annex_keys.mdwn has the
rationalle for doing it this way. There's no perfect solution; this
seems to be the least-bad one.
This commit was supported by the NSF-funded DataLad project.
This avoids sending all the data to a remote, only to have it reject it
because it has annex.securehashesonly set. It assumes that local and
remote will have the same annex.securehashesonly setting in most cases.
If a remote does not have that set, and local does, the remote won't get
some content it would otherwise accept.
Also avoids downloading data that will not be added to the local object
store due to annex.securehashesonly.
Note that, while encrypted special remotes use a GPGHMAC key variety,
which is not collisiton resistent, Transfers are not used for such
keys, so this check is avoided. Which is what we want, so encrypted
special remotes still work.
This commit was sponsored by Ewen McNeill.
Added --securehash option to match files using a secure hash function, and
corresponding securehash preferred content expression.
This commit was sponsored by Ethan Aubin.
Cryptographically secure hashes can be forced to be used in a repository,
by setting annex.securehashesonly. This does not prevent the git repository
from containing files with insecure hashes, but it does prevent the content
of such files from being pulled into .git/annex/objects from another
repository.
We want to make sure that at no point does git-annex accept content into
.git/annex/objects that is hashed with an insecure key. Here's how it
was done:
* .git/annex/objects/xx/yy/KEY/ is kept frozen, so nothing can be
written to it normally
* So every place that writes content must call, thawContent or modifyContent.
We can audit for these, and be sure we've considered all cases.
* The main functions are moveAnnex, and linkToAnnex; these were made to
check annex.securehashesonly, and are the main security boundary
for annex.securehashesonly.
* Most other calls to modifyContent deal with other files in the KEY
directory (inode cache etc). The other ones that mess with the content
are:
- Annex.Direct.toDirectGen, in which content already in the
annex directory is moved to the direct mode file, so not relevant.
- fix and lock, which don't add new content
- Command.ReKey.linkKey, which manually unlocks it to make a
copy.
* All other calls to thawContent appear safe.
Made moveAnnex return a Bool, so checked all callsites and made them
deal with a failure in appropriate ways.
linkToAnnex simply returns LinkAnnexFailed; all callsites already deal
with it failing in appropriate ways.
This commit was sponsored by Riku Voipio.
Note that GPGHMAC keys are not cryptographically secure, because their
content has no relation to the name of the key. So, things that use this
function to avoid sending keys to a remote will need to special case in
support for those keys. If GPGHMAC keys were accepted as
cryptographically secure, symlinks using them could be committed to a
git repo, and their content would be accepted into the repo, with no
guarantee that two repos got the same content, which is what we're aiming
to prevent.