QuickCheck 2.10 found a counterexample eg "\929184" broke the property.
As far as I can tell, Git.Filename is matching how git handles encoding
of strange high unicode characters in filenames for display. Git does
not display high unicode characters, and instead displays the C-style
escaped form of each byte. This is ambiguous, but since git is not
unicode aware, it doesn't need to roundtrip parse it.
So, making Git.FileName's roundtrip test only chars < 256 seems fine.
Utility.Format.format uses encode_c, in order to mimic git, so that's
ok.
Utility.Format.gen uses decode_c, but only so that stuff like "\n"
in the format string is handled. If the format string contains C-style
octal escapes, they will be converted to ascii characters, and not
combined into unicode characters, but that should not be a problem.
If the user wants unicode characters, they can include them in the
format string, without escaping them.
Finally, decode_c is used by Utility.Gpg.secretKeys, because gpg
--with-colons hex-escapes some characters in particular ':' and '\\'.
gpg passes unicode through, so this use of decode_c is not a problem.
This commit was sponsored by Henrik Riomar on Patreon.
Where before the "name" of a key and a backend was a string, this makes
it a concrete data type.
This is groundwork for allowing some varieties of keys to be disabled
in file2key, so git-annex won't use them at all.
Benchmarks ran in my big repo:
old git-annex info:
real 0m3.338s
user 0m3.124s
sys 0m0.244s
new git-annex info:
real 0m3.216s
user 0m3.024s
sys 0m0.220s
new git-annex find:
real 0m7.138s
user 0m6.924s
sys 0m0.252s
old git-annex find:
real 0m7.433s
user 0m7.240s
sys 0m0.232s
Surprising result; I'd have expected it to be slower since it now parses
all the key varieties. But, the parser is very simple and perhaps
sharing KeyVarieties uses less memory or something like that.
This commit was supported by the NSF-funded DataLad project.
Before, only content known to be present somewhere was considered a
duplicate. Now, any content that has been annexed before will be considered
a duplicate, even if all annexed copies of the data have been lost.
Note that --clean-duplicates and --deduplicate still check numcopies,
so won't delete duplicate files unless there's an annexed copy.
This makes import use the same method as reinject --known.
The man page already said that duplicate meant "its content is either
present in the local repository already, or git-annex knows of another
repository that contains it, or it was present in the annex before but has
been removed now". So, this is really only bringing the implementation into
line with the man page.
This commit was sponsored by Jochen Bartl on Patreon.
In the case where the pointer file is in place, and not the content
of the object, lock's performNew was called with filemodified=True,
which caused it to try to repopulate the object from an unmodified
associated file, of which there were none. So, the content of the object
got thrown away incorrectly. This was the cause (although not the root
cause) of data loss in https://github.com/datalad/datalad/issues/1020
The same problem could also occur when the work tree file is modified,
but the object is not, and lock is called with --force. Added a test case
for this, since it's excercising the same code path and is easier to set up
than the problem above.
Note that this only occurred when the keys database did not have an inode
cache recorded for the annex object. Normally, the annex object would be in
there, but there are of course circumstances where the inode cache is out
of sync with reality, since it's only a cache.
Fixed by checking if the object is unmodified; if so we don't need to
try to repopulate it. This does add an additional checksum to the unlock
path, but it's already checksumming the worktree file in another case,
so it doesn't slow it down overall.
Further investigation found a similar problem occurred when smudge --clean
is called on a file and the inode cache is not populated. cleanOldKeys
deleted the unmodified old object file in this case. This was also
fixed by checking if the object is unmodified.
In general, use of getInodeCaches and sameInodeCache is potentially
dangerous if the inode cache has not gotten populated for some reason.
Better to use isUnmodified. I breifly auited other places that check the
inode cache, and did not see any immediate problems, but it would be easy
to miss this kind of problem.
When git-annex is used with a git version older than 2.2.0, disable support for
adjusted branches, since GIT_COMMON_DIR is needed to update them and was first
added in that version of git.
The crash turned out to be caused by the sqlite database being deleted out
from under sqlite before it was done with it. Since multiple git_annex
calls are done in the same process while running the test suite, the
DbHandle could linger until GCed, and the test repo, and thus sqlite
database be deleted before the workerThread was done.
So, we need to look at both the file on disk to see if it's a annex link,
and the file in the index too. lookupFile doesn't look in the index if the file
is not present on disk.
have to change the content of unlocked file before committing
otherwise git commit will fail in v6 mode when the file was already
unlocked, because no changes have been made
In v5, that was not possible, but it is in v6, and so the test was failing.
Investigating, it turns out that locking was copying the pointer file
content to the annex object despite the content not being present. So,
add a check to prevent that.