git-annex/doc/todo/optimize_by_converting_String_to_ByteString.mdwn

38 lines
1.7 KiB
Text
Raw Normal View History

2019-11-26 20:11:55 +00:00
git-annex uses FilePath (String) extensively. That's a slow data type.
Converting to ByteString, and RawFilePath, should speed it up
significantly, according to [[/profiling]].
I've made a test branch, `bs`, to see what kind of performance improvement
2019-12-06 19:13:13 +00:00
to expect.
2019-11-26 20:11:55 +00:00
2019-12-06 19:13:13 +00:00
Benchmarking `git-annex find`, speedups range from 28-66%. The files fly by
much more snappily. Other commands likely also speed up, but do more work
than find so the improvement is not as large.
The `bs` branch is in a mergeable state now, but still needs work:
2019-11-26 20:11:55 +00:00
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions. Or at least most of them. There are likely
quite a few places where a value is converted back and forth several times.
As a first step, profile and look for the hot spots. Known hot spots:
* keyFile uses fromRawFilePath and that adds around 3% overhead in `git-annex find`.
Converting it to a RawFilePath needs a version of `</>` for RawFilePaths.
* getJournalFileStale uses fromRawFilePath, and adds 3-5% overhead in
2019-12-06 19:13:13 +00:00
`git-annex whereis`. Converting it to RawFilePath needs a version
of `</>` for RawFilePaths. It also needs a ByteString.readFile
for RawFilePath.
2019-11-26 20:11:55 +00:00
* System.FilePath is not available for RawFilePath, and many of the
conversions are to get a FilePath in order to use that library.
It should be entirely straightforward to make a version of System.FilePath
that can operate on RawFilePath, except possibly there could be some
complications due to Windows.
* Use versions of IO actions like getFileStatus that take a RawFilePath,
avoiding a conversion. Note that these are only available on unix, not
windows, so a compatability shim will be needed.
(I can't seem to find any library that provides one.)