This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
It means that every place a Key has any of its fields changed, the cache
has to be dropped. I've grepped and found them all. But, it would be
better to avoid that gotcha somehow..
Now there's a ByteString used all the way from disk to Key.
The main complication in this conversion was the use of fromInternalGitPath
in several places to munge things on Windows. The things that used that
were changed to parse the ByteString using either path separator.
Also some code that had read from files to a String lazily was changed
to read a minimal strict ByteString.
A keyName could contain "/", though this is unlikely and certianly only
ever could happen with WORM keys.
The change to addunused to escape that is no problem at all.
The change to VariantFile to escape it means that different versions of
git-annex could resolve a merge conflict differently in this case, which
is unfortunate. There would be different .variant files used, so the two
resolutions would themselves merge together without additional
conflicts, but the user would have to clean up the extra .variant
files.
What these generate is not really suitable to be used as a filename,
which is why keyFile and fileKey further escape it. These are just
serializing Keys.
Also removed a quickcheck test that was very unlikely to test anything
useful, since it relied on random chance creating something that looks
like a serialized key. The other test is sufficient for testing what
that was intended to test anyway.
The new parser is significantly stricter than the old one:
The old file2key allowed the fields to come in any order,
but the new one requires the fixed order that git-annex has always used.
Hopefully this will not cause any breakage.
And the old file2key allowed eg SHA1-m1-m2-m3-m4-m5-m6--xxxx
while the new does not allow duplication of fields. This could potentially
improve security, because allowing lots of extra junk like that in a key
could potentially be used in a SHA1 collision attack, although the current
attacks need binary data and not this kind of structured numeric data.
Speed improved of course, and fairly substantially, in microbenchmarks:
benchmarking old/key2file
time 2.264 μs (2.257 μs .. 2.273 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.265 μs (2.260 μs .. 2.275 μs)
std dev 21.17 ns (13.06 ns .. 39.26 ns)
benchmarking new/key2file'
time 1.744 μs (1.741 μs .. 1.747 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.745 μs (1.742 μs .. 1.751 μs)
std dev 13.55 ns (9.099 ns .. 21.89 ns)
benchmarking old/file2key
time 6.114 μs (6.102 μs .. 6.129 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 6.118 μs (6.106 μs .. 6.143 μs)
std dev 55.00 ns (30.08 ns .. 100.2 ns)
benchmarking new/file2key'
time 1.791 μs (1.782 μs .. 1.801 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 1.792 μs (1.785 μs .. 1.804 μs)
std dev 32.46 ns (20.59 ns .. 50.82 ns)
variance introduced by outliers: 19% (moderately inflated)
Not likely to be any speed gain here, but this completes porting every
log file over.
And, it let me get rid of code copied from ghc and modified, so
simplifying the licensing.
This preserves the workaround for the old bug that caused NoUUID items
to be stored in the log, prefixing log lines with " ". It's now handled
implicitly, by using takeWhile1 (/= ' ') to get the uuid.
There is a behavior change from the old parser, which split the value
into words and then recombined it. That meant that "foo bar" and "foo\tbar"
came out as "foo bar". That behavior was not documented, and seems
surprising; it meant that after a git-annex describe here "foo bar",
you wouldn't get that same string back out when git-annex displayed repo
descriptions.
Otoh, some other parsers relied on the old behavior, and the attoparsec
rewrites had to deal with the issue themselves...
For group.log, there are some edge cases around the user providing a
group name with a leading or trailing space. The old parser would ignore
such excess whitespace. The new parser does too, because the alternative
is to refuse to parse something like " group1 group2 " due to excess
whitespace, which would be even more confusing behavior.
The only git-annex branch log file that is not converted to attoparsec
and bytestring-builder now is transitions.log.
There should be some speed gains here, especially for chunk and remote
state logs, which are queried once per key.
Now only old-format uuid-based logs still need to be converted to attoparsec.
Mostly didn't push the ByteStrings down very deep, but all of these log
files are not written to frequently at all, so slight remaining
innefficiency doesn't matter.
In Logs.UUID, removed the fixBadUUID code that cleaned up after a bug in
git-annex versions 3.20111105-3.20111110. In the unlikely event that a repo was
last touched by that ancient git-annex version, the descriptions of remotes
would appear missing when used with this version of git-annex. That is such minor
breakage, and so unlikely to still be a problem for any repos, that it was not
worth forward-porting that code to ByteString.
Probably not any particular speedup in this, since most of these logs
are not written to often. Possibly chunk log writing is sped up, but
writes to chunk logs are interleaved with expensive data transfers to
remotes, so unlikely to be a noticiable speedup.