the encode' and decode' functions on Windows should not apply the
filesystem encoding, which does not work there. Instead, convert to and
from UTF-8.
Also, avoid exporting encodeW8 and decodeW8. Both use the filesystem
encoding, so won't work as expected on windows.
My ByteString rewrite oversimplified it, resulting in any _ in a journal
file turning into a / in the git-annex branch, which was often the wrong
filename, or sometimes (//) an invalid filename that git
refused to add.
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.
Benchmarks indicate around 30% speedup in both commands.
Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.
Git.Repo also changed to use RawFilePath for the path to the repo.
This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
Only done on those calls to getFileStatus that had a RawFilePath, not a
FilePath. The others would probably be just as fast if converted to use
it with toRawFilePath, but I'm not 100% sure.
Note that genInodeCache' uses fromRawFilePath, but that value only gets
used on Windows, so on unix the thunk will never be evaluated.
File mode is octal not decimal. This broke in the conversion to
attoparsec.
(I've submitted the content of Utility.Attoparsec to the attoparsec
developers.)
Test suite passes 100% now.
Empty filenames were already filtered out as not allowed. But before
the change to ByteString, a NUL could appear in an Arbitrary String,
and so Arbitrary AssociatedFile sometimes generated illegal filenames,
as NUL never appears in a filename. The change to ByteString meant the
String was run through toRawFilePath, which assumes a filename never
contains a NUL. That truncated the String at the NUL, which could
result in an AssociatedFile being generated with an empty filename.
The filtering of NUL added here is not really necessary, because
of the truncation, but it makes explicit that NUL is not allowed.
The real fix is that the suchThat now applies to the final
AssociatedFile, so will catch any empty ones however generated.
This raises the more general question of whether toRawFilePath might
truncate other strings that later get used as filenames. I think new
bugs probably won't be introduced by that. Before, a FilePath that got
read from somewhere (eg an attacker) and contained a NUL would perhaps
be printed out by git-annex, including the NUL, or written to disk
inside a file, or what have you. But as soon as that FilePath gets
passed to any IO action that treats it as a filename, it gets truncated
after the NUL. Eg, writeFile "foo\NULbar" "bar" writes to file "foo".
Now toRawFilePath will make the truncation happen earler, but at most
this will affect what gets printed out or is written to disk inside a
file; actually using the RawFilePath as a filename will not change from
using the FilePath as a filename.
I had thought using ByteString would avoid the problem, but the
quickcheck property is still taking Arbitrary String input, so the use
of ByteString internally doesn't matter.
This was already optimised before, but profiling found that delEntry was
around 1.5% of the total runtime of git-annex whereis. It was being
called once per environment variable per file processed.
Fixed by better caching. Since withIndexFile is almost always run with
the same .git/annex/index file, it can cache the modified environment,
rather than re-modifying it each time called.
(cherry picked from commit 6535aea49a)
This was already optimised before, but profiling found that delEntry was
around 1.5% of the total runtime of git-annex whereis. It was being
called once per environment variable per file processed.
Fixed by better caching. Since withIndexFile is almost always run with
the same .git/annex/index file, it can cache the modified environment,
rather than re-modifying it each time called.