followup
This commit is contained in:
parent
11d31f53bf
commit
f2772f469a
2 changed files with 100 additions and 0 deletions
54
Annex/HashObject.hs
Normal file
54
Annex/HashObject.hs
Normal file
|
@ -0,0 +1,54 @@
|
|||
{- git hash-object interface, with handle automatically stored in the Annex monad
|
||||
-
|
||||
- Copyright 2016 Joey Hess <id@joeyh.name>
|
||||
-
|
||||
- Licensed under the GNU GPL version 3 or higher.
|
||||
-}
|
||||
|
||||
module Annex.HashObject (
|
||||
hashFile,
|
||||
hashBlob,
|
||||
hashObjectHandle,
|
||||
hashObjectStop,
|
||||
) where
|
||||
|
||||
import qualified Data.ByteString.Lazy as L
|
||||
import qualified Data.Map as M
|
||||
import System.PosixCompat.Types
|
||||
|
||||
import Annex.Common
|
||||
import qualified Git
|
||||
import qualified Git.HashObject
|
||||
import qualified Annex
|
||||
import Git.Types
|
||||
import Git.FilePath
|
||||
import qualified Git.Ref
|
||||
import Annex.Link
|
||||
|
||||
hashObjectHandle :: Annex Git.HashObject.HashObjectHandle
|
||||
hashObjectHandle = maybe startup return =<< Annex.getState Annex.hashobjecthandle
|
||||
where
|
||||
startup = do
|
||||
inRepo $ Git.hashObjectStart
|
||||
Annex.changeState $ \s -> s { Annex.hashobjecthandle = Just h }
|
||||
return h
|
||||
|
||||
hashObjectStop :: Annex ()
|
||||
hashObjectStop = maybe noop stop =<< Annex.hashobjecthandle
|
||||
where
|
||||
stop h = do
|
||||
liftIO $ Git.hashObjectStop h
|
||||
Annex.changeState $ \s -> s { Annex.hashobjecthandle = Nothing }
|
||||
|
||||
hashFile :: FilePath -> Annex Sha
|
||||
hashFile f = do
|
||||
h <- hashObjectHandle
|
||||
Git.HashObject.hashFile h f
|
||||
|
||||
{- Note that the content will be written to a temp file.
|
||||
- So it may be faster to use Git.HashObject.hashObject for large
|
||||
- blob contents. -}
|
||||
hashBlob :: String -> Annex Sha
|
||||
hashBlob content = do
|
||||
h <- hashObjectHandle
|
||||
Git.HashObject.hashFile h content
|
|
@ -0,0 +1,46 @@
|
|||
[[!comment format=mdwn
|
||||
username="joey"
|
||||
subject="""comment 1"""
|
||||
date="2016-03-14T17:59:08Z"
|
||||
content="""
|
||||
If I've done the math right, 5 files per second over 3 hours is only 2000 files.
|
||||
The size of the files does matter, since git-annex has to read them all.
|
||||
You said the repo grew to 28 gb; does that mean you added 2000 files
|
||||
totalling 28 gb in size?
|
||||
|
||||
I can add 2000 tiny files (5 bytes each) in 2 seconds on a SSD on Linux.
|
||||
|
||||
By using a FAT filesystem, you've forced git-annex to use direct mode.
|
||||
Direct mode can be a little slower, but not a great deal. Adding 2000 files
|
||||
to a direct mode repo takes around 11 seconds here. (I did a little
|
||||
optimisation and sped that up to 7 seconds.)
|
||||
|
||||
Doing the same benchmark on a removable USB stick with a FAT filesystem
|
||||
was still not slow; 7 seconds again.
|
||||
|
||||
But then I had linux mount that FAT filesystem sync (so, it flushes each
|
||||
file write to disk, not buffering them), and I start getting closer to your
|
||||
slow speed; benchmark took 53 minutes.
|
||||
|
||||
So, I think the slow speed you're seeing is quite likely due to a
|
||||
combination of, in order from most to least important:
|
||||
|
||||
1. Synchronous writes to your disk drive. Fixable in linux by eg, running
|
||||
"mount -o remount,async /path/to/repo" and there's probably something
|
||||
similar for OSX.
|
||||
2. External drive being slow to access. (And if a spinning disk, slow to
|
||||
seek.)
|
||||
3. git-annex using direct mode on FAT
|
||||
|
||||
Also there is a fair amount of faff that git-annex does when adding a file
|
||||
around calling rename, stat, mkdir, etc multiple times. It may be possible
|
||||
to optimize some of that to get at some speedup on synchronous disks.
|
||||
But, I'd not expect more than a few percentage points speedup from such
|
||||
optimisation.
|
||||
|
||||
One other possiblity is you could be hitting an edge case where direct mode's
|
||||
performace is bad. One known such edge case is if you have a lot of files
|
||||
that all have the same content. For example, I made 2000 files that were
|
||||
all empty; adding them to a direct mode repository gets slower and slower
|
||||
to the point it's spending 10 or more seconds per file.
|
||||
"""]]
|
Loading…
Add table
Add a link
Reference in a new issue