close bug; copy benchmarking info to new page

This commit is contained in:
Joey Hess 2017-10-31 13:13:40 -04:00
parent d4429bcf00
commit fc8f1a7a1a
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
6 changed files with 204 additions and 0 deletions

3
doc/benchmarking.mdwn Normal file
View file

@ -0,0 +1,3 @@
This page exsists to collect benchmarking data about git-annex, so it can
be referred to later. If you have a specific instance where git-annex seems
unncessarily slow, please file a bug report about it.

View file

@ -0,0 +1,61 @@
[[!comment format=mdwn
username="joey"
subject="""comment 10"""
date="2016-09-29T18:33:33Z"
content="""
* Optimised key2file and file2key. 18% scanning time speedup.
* Optimised adjustGitEnv. 50% git-annex branch query speedup
* Optimised parsePOSIXTime. 10% git-annex branch query speedup
* Tried making catObjectDetails.receive use ByteString for parsing,
but that did not seem to speed it up significantly.
So it parsing is already fairly optimal, it's just that a
lot of data passes through it when querying the git-annex
branch.
After all that, profiling `git-annex find`:
Thu Sep 29 16:51 2016 Time and Allocation Profiling Report (Final)
git-annex.1 +RTS -p -RTS find
total time = 1.73 secs (1730 ticks @ 1000 us, 1 processor)
total alloc = 1,812,406,632 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
md5 Data.Hash.MD5 28.0 37.9
catchIO Utility.Exception 10.2 12.5
inAnnex'.checkindirect Annex.Content 9.9 3.7
catches Control.Monad.Catch 8.7 5.7
readish Utility.PartialPrelude 5.7 3.0
isAnnexLink Annex.Link 5.0 8.4
keyFile Annex.Locations 4.2 5.8
spanList Data.List.Utils 4.0 6.3
startswith Data.List.Utils 2.0 1.3
And `git-annex find --not --in web`:
Thu Sep 29 16:35 2016 Time and Allocation Profiling Report (Final)
git-annex +RTS -p -RTS find --not --in web
total time = 5.24 secs (5238 ticks @ 1000 us, 1 processor)
total alloc = 3,293,314,472 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
catObjectDetails.receive Git.CatFile 12.9 5.5
md5 Data.Hash.MD5 10.6 20.8
readish Utility.PartialPrelude 7.3 8.2
catchIO Utility.Exception 6.7 7.3
spanList Data.List.Utils 4.1 7.4
readFileStrictAnyEncoding Utility.Misc 3.5 1.3
catches Control.Monad.Catch 3.3 3.2
So, quite a large speedup overall!
This leaves md5 still unoptimised at 10-28% of CPU use. I looked at switching
it to cryptohash's implementation, but it would require quite a lot of
bit-banging math to pull the used values out of the ByteString containing
the md5sum.
"""]]

View file

@ -0,0 +1,49 @@
[[!comment format=mdwn
username="joey"
subject="""comment 11"""
date="2017-05-15T21:56:52Z"
content="""
Switched from MissingH to cryptonite for md5. It did move md5 out of the top CPU spot but
the overall runtime didn't change much. Memory allocations did go down by a
good amount.
Updated profiles:
git-annex +RTS -p -RTS find
total time = 1.63 secs (1629 ticks @ 1000 us, 1 processor)
total alloc = 1,496,336,496 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
catchIO Utility.Exception Utility/Exception.hs:79:1-17 14.1 15.1
inAnnex'.checkindirect Annex.Content Annex/Content.hs:(108,9)-(119,39) 10.6 4.8
catches Control.Monad.Catch src/Control/Monad/Catch.hs:(432,1)-(436,76) 8.6 6.9
spanList Data.List.Utils src/Data/List/Utils.hs:(150,1)-(155,36) 6.7 11.1
isAnnexLink Annex.Link Annex/Link.hs:35:1-85 5.0 10.2
keyFile Annex.Locations Annex/Locations.hs:(456,1)-(462,19) 5.0 7.0
readish Utility.PartialPrelude Utility/PartialPrelude.hs:(48,1)-(50,20) 3.8 2.0
startswith Data.List.Utils src/Data/List/Utils.hs:103:1-23 3.6 2.3
splitc Utility.Misc Utility/Misc.hs:(52,1)-(54,25) 3.4 6.5
s2w8 Data.Bits.Utils src/Data/Bits/Utils.hs:65:1-15 2.6 6.4
keyPath Annex.Locations Annex/Locations.hs:(492,1)-(494,23) 2.5 4.4
fileKey.unesc Annex.Locations Annex/Locations.hs:(469,9)-(474,39) 2.0 3.5
copyAndFreeze Data.ByteArray.Methods Data/ByteArray/Methods.hs:(224,1)-(227,21) 1.8 0.5
git-annex +RTS -p -RTS find --not --in web
total time = 5.33 secs (5327 ticks @ 1000 us, 1 processor)
total alloc = 2,908,489,000 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
catObjectDetails.\ Git.CatFile Git/CatFile.hs:(80,72)-(88,97) 7.8 2.8
catchIO Utility.Exception Utility/Exception.hs:79:1-17 7.6 8.3
spanList Data.List.Utils src/Data/List/Utils.hs:(150,1)-(155,36) 5.8 9.1
readish Utility.PartialPrelude Utility/PartialPrelude.hs:(48,1)-(50,20) 4.5 4.0
parseResp Git.CatFile Git/CatFile.hs:(113,1)-(124,28) 4.4 2.9
readFileStrict Utility.Misc Utility/Misc.hs:33:1-59 3.7 1.6
catches Control.Monad.Catch src/Control/Monad/Catch.hs:(432,1)-(436,76) 3.1 3.6
encodeW8 Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:(131,1)-(133,70) 3.1 2.3
"""]]

View file

@ -0,0 +1,45 @@
[[!comment format=mdwn
username="joey"
subject="""profiling"""
date="2016-09-26T19:20:36Z"
content="""
Built git-annex with profiling, using `stack build --profile`
(For reproduciblity, running git-annex in a clone of the git-annex repo
https://github.com/RichiH/conference_proceedings with rev
2797a49023fc24aff6fcaec55421572e1eddcfa2 checked out. It has 9496 annexed
objects.)
Profiling `git-annex find +RTS -p`:
total time = 3.53 secs (3530 ticks @ 1000 us, 1 processor)
total alloc = 3,772,700,720 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
spanList Data.List.Utils 32.6 37.7
startswith Data.List.Utils 14.3 8.1
md5 Data.Hash.MD5 12.4 18.2
join Data.List.Utils 6.9 13.7
catchIO Utility.Exception 5.9 6.0
catches Control.Monad.Catch 5.0 2.8
inAnnex'.checkindirect Annex.Content 4.6 1.8
readish Utility.PartialPrelude 3.0 1.4
isAnnexLink Annex.Link 2.6 4.0
split Data.List.Utils 1.5 0.8
keyPath Annex.Locations 1.2 1.7
This is interesting!
Fully 40% of CPU time and allocations are in list (really String) processing,
and the details of the profiling report show that `spanList` and `startsWith`
and `join` are all coming from calls to `replace` in `keyFile` and `fileKey`.
Both functions nest several calls to replace, so perhaps that could be unwound
into a single pass and/or a ByteString used to do it more efficiently.
12% of run time is spent calculating the md5 hashes for the hash
directories for .git/annex/objects. Data.Hash.MD5 is from missingh, and
it is probably a quite unoptimised version. Switching to the version
if cryptonite would probably speed it up a lot.
"""]]

View file

@ -0,0 +1,42 @@
[[!comment format=mdwn
username="joey"
subject="""more profiling"""
date="2016-09-26T19:59:43Z"
content="""
Instead of profiling `git annex copy --to remote`, I profiled `git annex
find --not --in web`, which needs to do the same kind of location log lookup.
total time = 12.41 secs (12413 ticks @ 1000 us, 1 processor)
total alloc = 8,645,057,104 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
adjustGitEnv Git.Env 21.4 37.0
catchIO Utility.Exception 13.2 2.8
spanList Data.List.Utils 12.6 17.9
parsePOSIXTime Logs.TimeStamp 6.1 5.0
catObjectDetails.receive Git.CatFile 5.9 2.1
startswith Data.List.Utils 5.7 3.8
md5 Data.Hash.MD5 5.1 7.9
join Data.List.Utils 2.4 6.0
readFileStrictAnyEncoding Utility.Misc 2.2 0.5
The adjustGitEnv overhead is a surprise! It seems it is getting called once
per file, and allocating a new copy of the environment each time. Call stack:
withIndex calls withIndexFile calls addGitEnv calls adjustGitEnv.
Looks like simply making gitEnv be cached at startup would avoid most of
the adjustGitEnv slowdown.
(The catchIO overhead is a false reading; the detailed profile shows
that all its time and allocations are inherited. getAnnexLinkTarget
is running catchIO in the expensive case, so readSymbolicLink is
the actual expensive bit.)
The parsePOSIXTime comes from reading location logs. It's implemented
using a generic Data.Time.Format.parseTime, which uses a format string
"%s%Qs". A custom parser that splits into seconds and picoseconds
and simply reads both numbers might be more efficient.
catObjectDetails.receive is implemented using mostly String and could
probably be sped up by being converted to use ByteString.
"""]]

View file

@ -1,3 +1,7 @@
I was trying to copy files which failed to copy (3 out of 6,000) to remote host after copy -J4. Succeeded. But with subsequent runs, apparently even with copy --fast it takes annex 10 sec for annex to realize there is nothing to copy. git ls-files which annex calls returns list immediately, so it is really some parsing/access to data under git-annex branch which takes awhile. I think we had similar discussion before but couldn't find. So I wondered to whine again to see if some optimization is possible to make --fast copies faster, especially whenever there is nothing to copy.
[[!meta author=yoh]]
> closing as yoh is happy. [[done]]. Note that I copied benechmarking
> related comments to the [[/benchmarking]] page for future reference.
> --[[Joey]]