followup

2016-03-14 15:54:46 -04:00 · 2016-03-14 15:54:46 -04:00 · f2772f469a
commit f2772f469a
parent 11d31f53bf
2 changed files with 100 additions and 0 deletions
--- a/Annex/HashObject.hs
+++ b/Annex/HashObject.hs
@ -0,0 +1,54 @@
 {- git hash-object interface, with handle automatically stored in the Annex monad
 -
 - Copyright 2016 Joey Hess <id@joeyh.name>
 -
 - Licensed under the GNU GPL version 3 or higher.
 -}
 module Annex.HashObject (
 	hashFile,
 	hashBlob,
 	hashObjectHandle,
 	hashObjectStop,
 ) where
 import qualified Data.ByteString.Lazy as L
 import qualified Data.Map as M
 import System.PosixCompat.Types
 import Annex.Common
 import qualified Git
 import qualified Git.HashObject
 import qualified Annex
 import Git.Types
 import Git.FilePath
 import qualified Git.Ref
 import Annex.Link
 hashObjectHandle :: Annex Git.HashObject.HashObjectHandle
 hashObjectHandle = maybe startup return =<< Annex.getState Annex.hashobjecthandle
  where
 	startup = do
 		inRepo $ Git.hashObjectStart
 		Annex.changeState $ \s -> s { Annex.hashobjecthandle = Just h }
 		return h
 hashObjectStop :: Annex ()
 hashObjectStop = maybe noop stop =<< Annex.hashobjecthandle
  where
 	stop h = do
 		liftIO $ Git.hashObjectStop h
 		Annex.changeState $ \s -> s { Annex.hashobjecthandle = Nothing }
 hashFile :: FilePath -> Annex Sha
 hashFile f = do
 	h <- hashObjectHandle
 	Git.HashObject.hashFile h f
 {- Note that the content will be written to a temp file.
 - So it may be faster to use Git.HashObject.hashObject for large
 - blob contents. -}
 hashBlob :: String -> Annex Sha
 hashBlob content = do
 	h <- hashObjectHandle
 	Git.HashObject.hashFile h content
--- a/doc/bugs/39add39_results_in_max_cpu44_long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment
+++ b/doc/bugs/39add39_results_in_max_cpu44_long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment
@ -0,0 +1,46 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 1"""
 date="2016-03-14T17:59:08Z"
 content="""
 If I've done the math right, 5 files per second over 3 hours is only 2000 files.
 The size of the files does matter, since git-annex has to read them all.
 You said the repo grew to 28 gb; does that mean you added 2000 files
 totalling 28 gb in size?
 I can add 2000 tiny files (5 bytes each) in 2 seconds on a SSD on Linux.
 By using a FAT filesystem, you've forced git-annex to use direct mode.
 Direct mode can be a little slower, but not a great deal. Adding 2000 files
 to a direct mode repo takes around 11 seconds here. (I did a little
 optimisation and sped that up to 7 seconds.)
 Doing the same benchmark on a removable USB stick with a FAT filesystem
 was still not slow; 7 seconds again.
 But then I had linux mount that FAT filesystem sync (so, it flushes each
 file write to disk, not buffering them), and I start getting closer to your
 slow speed; benchmark took 53 minutes.
 So, I think the slow speed you're seeing is quite likely due to a
 combination of, in order from most to least important:
 1. Synchronous writes to your disk drive. Fixable in linux by eg, running
   "mount -o remount,async /path/to/repo" and there's probably something
   similar for OSX.
 2. External drive being slow to access. (And if a spinning disk, slow to
   seek.)
 3. git-annex using direct mode on FAT
 Also there is a fair amount of faff that git-annex does when adding a file
 around calling rename, stat, mkdir, etc multiple times. It may be possible
 to optimize some of that to get at some speedup on synchronous disks.
 But, I'd not expect more than a few percentage points speedup from such
 optimisation.
 One other possiblity is you could be hitting an edge case where direct mode's
 performace is bad. One known such edge case is if you have a lot of files
 that all have the same content. For example, I made 2000 files that were
 all empty; adding them to a direct mode repository gets slower and slower
 to the point it's spending 10 or more seconds per file. 
 """]]