followup

2016-03-14 15:54:46 -04:00 · 2016-03-14 15:54:46 -04:00 · f2772f469a
commit f2772f469a
parent 11d31f53bf
2 changed files with 100 additions and 0 deletions
--- a/Annex/HashObject.hs
+++ b/Annex/HashObject.hs
@ -0,0 +1,54 @@
+{- git hash-object interface, with handle automatically stored in the Annex monad
+ -
+ - Copyright 2016 Joey Hess <id@joeyh.name>
+ -
+ - Licensed under the GNU GPL version 3 or higher.
+ -}
+
+module Annex.HashObject (
+	hashFile,
+	hashBlob,
+	hashObjectHandle,
+	hashObjectStop,
+) where
+
+import qualified Data.ByteString.Lazy as L
+import qualified Data.Map as M
+import System.PosixCompat.Types
+
+import Annex.Common
+import qualified Git
+import qualified Git.HashObject
+import qualified Annex
+import Git.Types
+import Git.FilePath
+import qualified Git.Ref
+import Annex.Link
+
+hashObjectHandle :: Annex Git.HashObject.HashObjectHandle
+hashObjectHandle = maybe startup return =<< Annex.getState Annex.hashobjecthandle
+  where
+	startup = do
+		inRepo $ Git.hashObjectStart
+		Annex.changeState $ \s -> s { Annex.hashobjecthandle = Just h }
+		return h
+
+hashObjectStop :: Annex ()
+hashObjectStop = maybe noop stop =<< Annex.hashobjecthandle
+  where
+	stop h = do
+		liftIO $ Git.hashObjectStop h
+		Annex.changeState $ \s -> s { Annex.hashobjecthandle = Nothing }
+
+hashFile :: FilePath -> Annex Sha
+hashFile f = do
+	h <- hashObjectHandle
+	Git.HashObject.hashFile h f
+
+{- Note that the content will be written to a temp file.
+ - So it may be faster to use Git.HashObject.hashObject for large
+ - blob contents. -}
+hashBlob :: String -> Annex Sha
+hashBlob content = do
+	h <- hashObjectHandle
+	Git.HashObject.hashFile h content
--- a/doc/bugs/39add39_results_in_max_cpu44_long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment
+++ b/doc/bugs/39add39_results_in_max_cpu44_long_run_and_huge_repo/comment_1_3233c29405da296360d57af7d5eb418d._comment
@ -0,0 +1,46 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2016-03-14T17:59:08Z"
+ content="""
+If I've done the math right, 5 files per second over 3 hours is only 2000 files.
+The size of the files does matter, since git-annex has to read them all.
+You said the repo grew to 28 gb; does that mean you added 2000 files
+totalling 28 gb in size?
+
+I can add 2000 tiny files (5 bytes each) in 2 seconds on a SSD on Linux.
+
+By using a FAT filesystem, you've forced git-annex to use direct mode.
+Direct mode can be a little slower, but not a great deal. Adding 2000 files
+to a direct mode repo takes around 11 seconds here. (I did a little
+optimisation and sped that up to 7 seconds.)
+
+Doing the same benchmark on a removable USB stick with a FAT filesystem
+was still not slow; 7 seconds again.
+
+But then I had linux mount that FAT filesystem sync (so, it flushes each
+file write to disk, not buffering them), and I start getting closer to your
+slow speed; benchmark took 53 minutes.
+
+So, I think the slow speed you're seeing is quite likely due to a
+combination of, in order from most to least important:
+
+1. Synchronous writes to your disk drive. Fixable in linux by eg, running
+   "mount -o remount,async /path/to/repo" and there's probably something
+   similar for OSX.
+2. External drive being slow to access. (And if a spinning disk, slow to
+   seek.)
+3. git-annex using direct mode on FAT
+
+Also there is a fair amount of faff that git-annex does when adding a file
+around calling rename, stat, mkdir, etc multiple times. It may be possible
+to optimize some of that to get at some speedup on synchronous disks.
+But, I'd not expect more than a few percentage points speedup from such
+optimisation.
+
+One other possiblity is you could be hitting an edge case where direct mode's
+performace is bad. One known such edge case is if you have a lot of files
+that all have the same content. For example, I made 2000 files that were
+all empty; adding them to a direct mode repository gets slower and slower
+to the point it's spending 10 or more seconds per file. 
+"""]]