This commit is contained in:
Joey Hess 2016-03-14 15:54:46 -04:00
parent 11d31f53bf
commit f2772f469a
Failed to extract signature
2 changed files with 100 additions and 0 deletions

54
Annex/HashObject.hs Normal file
View file

@ -0,0 +1,54 @@
{- git hash-object interface, with handle automatically stored in the Annex monad
-
- Copyright 2016 Joey Hess <id@joeyh.name>
-
- Licensed under the GNU GPL version 3 or higher.
-}
module Annex.HashObject (
hashFile,
hashBlob,
hashObjectHandle,
hashObjectStop,
) where
import qualified Data.ByteString.Lazy as L
import qualified Data.Map as M
import System.PosixCompat.Types
import Annex.Common
import qualified Git
import qualified Git.HashObject
import qualified Annex
import Git.Types
import Git.FilePath
import qualified Git.Ref
import Annex.Link
hashObjectHandle :: Annex Git.HashObject.HashObjectHandle
hashObjectHandle = maybe startup return =<< Annex.getState Annex.hashobjecthandle
where
startup = do
inRepo $ Git.hashObjectStart
Annex.changeState $ \s -> s { Annex.hashobjecthandle = Just h }
return h
hashObjectStop :: Annex ()
hashObjectStop = maybe noop stop =<< Annex.hashobjecthandle
where
stop h = do
liftIO $ Git.hashObjectStop h
Annex.changeState $ \s -> s { Annex.hashobjecthandle = Nothing }
hashFile :: FilePath -> Annex Sha
hashFile f = do
h <- hashObjectHandle
Git.HashObject.hashFile h f
{- Note that the content will be written to a temp file.
- So it may be faster to use Git.HashObject.hashObject for large
- blob contents. -}
hashBlob :: String -> Annex Sha
hashBlob content = do
h <- hashObjectHandle
Git.HashObject.hashFile h content

View file

@ -0,0 +1,46 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2016-03-14T17:59:08Z"
content="""
If I've done the math right, 5 files per second over 3 hours is only 2000 files.
The size of the files does matter, since git-annex has to read them all.
You said the repo grew to 28 gb; does that mean you added 2000 files
totalling 28 gb in size?
I can add 2000 tiny files (5 bytes each) in 2 seconds on a SSD on Linux.
By using a FAT filesystem, you've forced git-annex to use direct mode.
Direct mode can be a little slower, but not a great deal. Adding 2000 files
to a direct mode repo takes around 11 seconds here. (I did a little
optimisation and sped that up to 7 seconds.)
Doing the same benchmark on a removable USB stick with a FAT filesystem
was still not slow; 7 seconds again.
But then I had linux mount that FAT filesystem sync (so, it flushes each
file write to disk, not buffering them), and I start getting closer to your
slow speed; benchmark took 53 minutes.
So, I think the slow speed you're seeing is quite likely due to a
combination of, in order from most to least important:
1. Synchronous writes to your disk drive. Fixable in linux by eg, running
"mount -o remount,async /path/to/repo" and there's probably something
similar for OSX.
2. External drive being slow to access. (And if a spinning disk, slow to
seek.)
3. git-annex using direct mode on FAT
Also there is a fair amount of faff that git-annex does when adding a file
around calling rename, stat, mkdir, etc multiple times. It may be possible
to optimize some of that to get at some speedup on synchronous disks.
But, I'd not expect more than a few percentage points speedup from such
optimisation.
One other possiblity is you could be hitting an edge case where direct mode's
performace is bad. One known such edge case is if you have a lot of files
that all have the same content. For example, I made 2000 files that were
all empty; adding them to a direct mode repository gets slower and slower
to the point it's spending 10 or more seconds per file.
"""]]