Merge branch 'master' into bs

This commit is contained in:
Joey Hess 2019-11-27 13:47:03 -04:00
commit c914058bf9
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
6 changed files with 124 additions and 0 deletions

View file

@ -0,0 +1,12 @@
Two entire days spent making a branch where git-annex uses ByteString
instead of String, especially for filepaths. I commented out all the
commands except for find, but it still took thousands of lines of patches
to get it to compile.
The result: git-annex find is between 28% and 66% faster when using
ByteString. The files just fly by!
It's going to be a long, long road to finish this, but it's good to have a
start, and know it will be worth it.
[[todo/optimize_by_converting_String_to_ByteString]] is the tracking page
for this going forward.

View file

@ -0,0 +1,9 @@
[[!comment format=mdwn
username="linnearight02@915958f850452a19de84ec14a765402d1f7ecdb0"
nickname="linnearight02"
avatar="http://cdn.libravatar.org/avatar/9c146ceff6ab204aa75ec5a686bd6cfb"
subject="Online Coursework Service"
date="2019-11-26T11:11:07Z"
content="""
Get the best [online coursework service](https://www.allassignmenthelp.com/online-coursework-service.html) by the top Aussie writers at cheap rates. We at [AllAssignmentHelp](https://www.allassignmenthelp.com/) known to provide custom coursework services and unlimited support to the Australian students when they place an order with us. All of our writers are well-qualified and trained professional writers, thus no need to be worried about the quality of the delivered work.
"""]]

View file

@ -0,0 +1,14 @@
In neurophysiology we encounter HUGE files (HDF5 .nwb files).
Sizes reach hundreds of GBs per file (thus exceeding any possible file system memory cache size). While operating in the cloud or on a fast connection it is possible to fetch the files with speeds up to 100 MBps.
Upon successful download such files are then loaded back by git-annex for the checksum validation, and often at slower speeds (eg <60MBps on EC2 SSD drive).
So, ironically, it does not just double, but rather nearly triples overall time to obtain a file.
I think ideally,
- (at minimum) for built-in special remotes (such as web), it would be great if git-annex was check-summing incrementally as data comes in;
- made it possible to for external special remotes to provide desired checksum on obtained content. First git-annex should of cause inform them on type (backend) of the checksum it is interested in, and may be have some information reported by external remotes on what checksums they support.
If needed example, here is http://datasets.datalad.org/allen-brain-observatory/visual-coding-2p/.git with >50GB files such as ophys_movies/ophys_experiment_576261945.h5 .
[[!meta author=yoh]]
[[!tag projects/dandi]]

View file

@ -0,0 +1,11 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="use named pipes?"
date="2019-11-25T16:45:26Z"
content="""
For external remotes can pass to the `TRANSFER` request, as the `FILE` parameter, a named pipe, and use `tee` to create a separate stream for checksumming.
An external remote could also do its own checksum checking and then set remote.<name>.annex-verify=false.
Could also make a “wrapper” external remote that delegates all requests to a given external remote but does checksum-checking in parallel with downloading (by creating a named pipe and passing that to the wrapped remote).
"""]]

View file

@ -0,0 +1,34 @@
git-annex uses FilePath (String) extensively. That's a slow data type.
Converting to ByteString, and RawFilePath, should speed it up
significantly, according to [[/profiling]].
I've made a test branch, `bs`, to see what kind of performance improvement
to expect. Most commands don't built yet in that branch, but `git annex
find` does. Speedups range from 28-66%. The files fly by much more
snappily.
As well as adding back all the code that was disabled to get it to build,
the `bs` branch has quite a lot of things still needing work, including:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions. Or at least most of them. There are likely
quite a few places where a value is converted back and forth several times.
It would be good to instrument them with Debug.Trace and find out which
are the hot ones that get called, and focus on those.
* System.FilePath is not available for RawFilePath, and many of the
conversions are to get a FilePath in order to use that library.
It should be entirely straightforward to make a version of System.FilePath
that can operate on RawFilePath, except possibly there could be some
complications due to Windows.
* Use versions of IO actions like getFileStatus that take a RawFilePath,
avoiding a conversion. Note that these are only available on unix, not
windows, so a compatability shim will be needed.
(I can't seem to find any library that provides one.)
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.

View file

@ -0,0 +1,44 @@
[[!comment format=mdwn
username="joey"
subject="""profiling"""
date="2019-11-26T20:05:28Z"
content="""
Profiling the early version of the `bs` branch.
Tue Nov 26 16:05 2019 Time and Allocation Profiling Report (Final)
git-annex +RTS -p -RTS find
total time = 2.75 secs (2749 ticks @ 1000 us, 1 processor)
total alloc = 1,642,607,120 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
inAnnex'.\ Annex.Content Annex/Content.hs:(103,61)-(118,31) 31.2 46.8
keyFile' Annex.Locations Annex/Locations.hs:(567,1)-(577,30) 5.3 6.2
encodeW8 Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:(189,1)-(191,70) 3.3 4.2
>>=.\ Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(146,9)-(147,44) 2.9 0.8
>>=.\.succ' Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76 2.6 0.3
keyFile'.esc Annex.Locations Annex/Locations.hs:(573,9)-(577,30) 2.5 5.5
parseLinkTarget Annex.Link Annex/Link.hs:(254,1)-(262,25) 2.4 4.4
getAnnexLinkTarget'.probesymlink Annex.Link Annex/Link.hs:78:9-46 2.4 2.8
w82s Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:217:1-15 2.3 6.0
keyPath Annex.Locations Annex/Locations.hs:(606,1)-(608,23) 1.9 4.0
parseKeyVariety Types.Key Types/Key.hs:(323,1)-(371,42) 1.8 0.0
getState Annex Annex.hs:(251,1)-(254,27) 1.7 0.4
fileKey'.go Annex.Locations Annex/Locations.hs:588:9-55 1.4 0.8
fileKey' Annex.Locations Annex/Locations.hs:(586,1)-(596,41) 1.4 1.7
hashUpdates.\.\.\ Crypto.Hash Crypto/Hash.hs:85:48-99 1.3 0.0
parseLinkTargetOrPointer Annex.Link Annex/Link.hs:(239,1)-(243,25) 1.2 0.1
withPtr Basement.Block.Base Basement/Block/Base.hs:(395,1)-(404,31) 1.2 0.6
primitive Basement.Monad Basement/Monad.hs:72:5-18 1.0 0.1
decodeBS' Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:151:1-31 1.0 2.8
mkKeySerialization Types.Key Types/Key.hs:(115,1)-(117,22) 0.7 1.1
w82c Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:211:1-28 0.6 1.1
Comparing with [[/profiling]] results, the alloc is down significantly.
And the main IO actions are getting a larger share of the runtime.
There is still significantly conversion going on, encodeW8 and w82s and
decodeBS' and w82c. Likely another 5% or so speedup if that's eliminated.
"""]]