Merge branch 'master' into bs

2019-11-27 13:47:03 -04:00 · 2019-11-27 13:47:03 -04:00 · c914058bf9
commit c914058bf9
parent 067aabdd48 a2b566be29
6 changed files with 124 additions and 0 deletions
--- a/doc/devblog/day_610-611__ByteString_optimisation_early_days.mdwn
+++ b/doc/devblog/day_610-611__ByteString_optimisation_early_days.mdwn
@ -0,0 +1,12 @@
+Two entire days spent making a branch where git-annex uses ByteString
+instead of String, especially for filepaths. I commented out all the
+commands except for find, but it still took thousands of lines of patches
+to get it to compile.
+
+The result: git-annex find is between 28% and 66% faster when using
+ByteString. The files just fly by!
+
+It's going to be a long, long road to finish this, but it's good to have a
+start, and know it will be worth it.
+[[todo/optimize_by_converting_String_to_ByteString]] is the tracking page
+for this going forward.
--- a/doc/git-annex-add/comment_7_46fe1a6d38eecb246f5b5659bc0c00c8._comment
+++ b/doc/git-annex-add/comment_7_46fe1a6d38eecb246f5b5659bc0c00c8._comment
@ -0,0 +1,9 @@
+[[!comment format=mdwn
+ username="linnearight02@915958f850452a19de84ec14a765402d1f7ecdb0"
+ nickname="linnearight02"
+ avatar="http://cdn.libravatar.org/avatar/9c146ceff6ab204aa75ec5a686bd6cfb"
+ subject="Online Coursework Service"
+ date="2019-11-26T11:11:07Z"
+ content="""
+Get the best [online coursework service](https://www.allassignmenthelp.com/online-coursework-service.html) by the top Aussie writers at cheap rates. We at [AllAssignmentHelp](https://www.allassignmenthelp.com/) known to provide custom coursework services and unlimited support to the Australian students when they place an order with us. All of our writers are well-qualified and trained professional writers, thus no need to be worried about the quality of the delivered work.
+"""]]
--- a/doc/todo/OPT58_34bundle34_get_+_check_40of_checksum41_in_a_single_operation.mdwn
+++ b/doc/todo/OPT58_34bundle34_get_+_check_40of_checksum41_in_a_single_operation.mdwn
@ -0,0 +1,14 @@
+In neurophysiology we encounter HUGE files (HDF5 .nwb files).
+Sizes reach hundreds of GBs per file (thus exceeding any possible file system memory cache size).  While operating in the cloud or on a fast connection it is possible to fetch the files with speeds up to 100 MBps. 
+Upon successful download such files are then loaded back by git-annex for the checksum validation, and often at slower speeds (eg <60MBps on EC2 SSD drive).
+So, ironically, it does not just double, but rather nearly triples overall time to obtain a file.
+
+I think ideally, 
+
+- (at minimum) for built-in special remotes (such as web), it would be great if git-annex was check-summing incrementally as data comes in;
+- made it possible to for external special remotes to provide desired checksum on obtained content. First git-annex should of cause inform them on type (backend) of the checksum it is interested in, and may be have some information reported by external remotes on what checksums they support.
+
+If needed example, here is http://datasets.datalad.org/allen-brain-observatory/visual-coding-2p/.git with >50GB files such as ophys_movies/ophys_experiment_576261945.h5 .
+
+[[!meta author=yoh]]
+[[!tag projects/dandi]]
--- a/doc/todo/OPT58_34bundle34_get_+_check_40of_checksum41_in_a_single_operation/comment_1_29e601ea3ea4f22301c6cf6eed403ba4._comment
+++ b/doc/todo/OPT58_34bundle34_get_+_check_40of_checksum41_in_a_single_operation/comment_1_29e601ea3ea4f22301c6cf6eed403ba4._comment
@ -0,0 +1,11 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="use named pipes?"
+ date="2019-11-25T16:45:26Z"
+ content="""
+For external remotes can pass to the `TRANSFER` request, as the `FILE` parameter, a named pipe, and use `tee` to create a separate stream for checksumming.
+
+An external remote could also do its own checksum checking and then set remote.<name>.annex-verify=false.
+Could also make a “wrapper” external remote that delegates all requests to a given external remote but does checksum-checking in parallel with downloading (by creating a named pipe and passing that to the wrapped remote).
+"""]]
--- a/doc/todo/optimize_by_converting_String_to_ByteString.mdwn
+++ b/doc/todo/optimize_by_converting_String_to_ByteString.mdwn
@ -0,0 +1,34 @@
+git-annex uses FilePath (String) extensively. That's a slow data type.
+Converting to ByteString, and RawFilePath, should speed it up
+significantly, according to [[/profiling]].
+
+I've made a test branch, `bs`, to see what kind of performance improvement
+to expect. Most commands don't built yet in that branch, but `git annex
+find` does. Speedups range from 28-66%. The files fly by much more
+snappily.
+
+As well as adding back all the code that was disabled to get it to build,
+the `bs` branch has quite a lot of things still needing work, including:
+
+* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
+  decodeBS conversions. Or at least most of them. There are likely
+  quite a few places where a value is converted back and forth several times.
+
+  It would be good to instrument them with Debug.Trace and find out which
+  are the hot ones that get called, and focus on those.
+
+* System.FilePath is not available for RawFilePath, and many of the
+  conversions are to get a FilePath in order to use that library.
+
+  It should be entirely straightforward to make a version of System.FilePath
+  that can operate on RawFilePath, except possibly there could be some
+  complications due to Windows.
+
+* Use versions of IO actions like getFileStatus that take a RawFilePath,
+  avoiding a conversion. Note that these are only available on unix, not
+  windows, so a compatability shim will be needed.
+  (I can't seem to find any library that provides one.)
+
+* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
+
+* Use ByteString for parsing git config to speed up startup.
--- a/doc/todo/optimize_by_converting_String_to_ByteString/comment_1_403601fa8ad6946eca8f598bdc31f2d7._comment
+++ b/doc/todo/optimize_by_converting_String_to_ByteString/comment_1_403601fa8ad6946eca8f598bdc31f2d7._comment
@ -0,0 +1,44 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""profiling"""
+ date="2019-11-26T20:05:28Z"
+ content="""
+Profiling the early version of the `bs` branch. 
+
+		Tue Nov 26 16:05 2019 Time and Allocation Profiling Report  (Final)
+	
+		   git-annex +RTS -p -RTS find
+	
+		total time  =        2.75 secs   (2749 ticks @ 1000 us, 1 processor)
+		total alloc = 1,642,607,120 bytes  (excludes profiling overheads)
+	
+	COST CENTRE                      MODULE                         SRC                                                 %time %alloc
+	
+	inAnnex'.\                       Annex.Content                  Annex/Content.hs:(103,61)-(118,31)                   31.2   46.8
+	keyFile'                         Annex.Locations                Annex/Locations.hs:(567,1)-(577,30)                   5.3    6.2
+	encodeW8                         Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:(189,1)-(191,70)        3.3    4.2
+	>>=.\                            Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(146,9)-(147,44)    2.9    0.8
+	>>=.\.succ'                      Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76           2.6    0.3
+	keyFile'.esc                     Annex.Locations                Annex/Locations.hs:(573,9)-(577,30)                   2.5    5.5
+	parseLinkTarget                  Annex.Link                     Annex/Link.hs:(254,1)-(262,25)                        2.4    4.4
+	getAnnexLinkTarget'.probesymlink Annex.Link                     Annex/Link.hs:78:9-46                                 2.4    2.8
+	w82s                             Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:217:1-15                2.3    6.0
+	keyPath                          Annex.Locations                Annex/Locations.hs:(606,1)-(608,23)                   1.9    4.0
+	parseKeyVariety                  Types.Key                      Types/Key.hs:(323,1)-(371,42)                         1.8    0.0
+	getState                         Annex                          Annex.hs:(251,1)-(254,27)                             1.7    0.4
+	fileKey'.go                      Annex.Locations                Annex/Locations.hs:588:9-55                           1.4    0.8
+	fileKey'                         Annex.Locations                Annex/Locations.hs:(586,1)-(596,41)                   1.4    1.7
+	hashUpdates.\.\.\                Crypto.Hash                    Crypto/Hash.hs:85:48-99                               1.3    0.0
+	parseLinkTargetOrPointer         Annex.Link                     Annex/Link.hs:(239,1)-(243,25)                        1.2    0.1
+	withPtr                          Basement.Block.Base            Basement/Block/Base.hs:(395,1)-(404,31)               1.2    0.6
+	primitive                        Basement.Monad                 Basement/Monad.hs:72:5-18                             1.0    0.1
+	decodeBS'                        Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:151:1-31                1.0    2.8
+	mkKeySerialization               Types.Key                      Types/Key.hs:(115,1)-(117,22)                         0.7    1.1
+	w82c                             Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:211:1-28                0.6    1.1
+
+Comparing with [[/profiling]] results, the alloc is down significantly.
+And the main IO actions are getting a larger share of the runtime.
+
+There is still significantly conversion going on, encodeW8 and w82s and
+decodeBS' and w82c. Likely another 5% or so speedup if that's eliminated.
+"""]]