speed up keys database writes

There seems to be no reason to check the time here. I think it was
inherited from code in Database.Fsck, which does have a reason to commit
every few minutes. Removing that syscall speeds up a git-annex init
in a repo with 100000 annexed files by about 3 seconds.

Sponsored-by: Dartmouth College's Datalad project
This commit is contained in:
Joey Hess 2021-05-31 14:56:14 -04:00
parent 0f54e5e0ae
commit eb6f6ff9b8
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 26 additions and 8 deletions

View file

@ -88,7 +88,9 @@ addDb :: FsckHandle -> Key -> IO ()
addDb (FsckHandle h _) k = H.queueDb h checkcommit $
void $ insertUnique $ Fscked k
where
-- commit queue after 1000 files or 5 minutes, whichever comes first
-- Commit queue after 1000 changes or 5 minutes, whichever comes first.
-- The time based commit allows for an incremental fsck to be
-- interrupted and not lose much work.
checkcommit sz lastcommittime
| sz > 1000 = return True
| otherwise = do

View file

@ -27,7 +27,6 @@ import Git.FilePath
import Database.Persist.Sql hiding (Key)
import Database.Persist.TH
import Data.Time.Clock
import Control.Monad
import Data.Maybe
@ -77,12 +76,8 @@ newtype WriteHandle = WriteHandle H.DbQueue
queueDb :: SqlPersistM () -> WriteHandle -> IO ()
queueDb a (WriteHandle h) = H.queueDb h checkcommit a
where
-- commit queue after 1000 changes or 5 minutes, whichever comes first
checkcommit sz lastcommittime
| sz > 1000 = return True
| otherwise = do
now <- getCurrentTime
return $ diffUTCTime now lastcommittime > 300
-- commit queue after 1000 changes
checkcommit sz _lastcommittime = pure (sz > 1000)
addAssociatedFile :: Key -> TopFilePath -> WriteHandle -> IO ()
addAssociatedFile k f = queueDb $ do

View file

@ -4,3 +4,6 @@ E.g. following idea came to mind: git-annex could add some flag/beacon file (e.g
[[!meta author=yoh]]
[[!tag projects/datalad]]
> I think I've improved this all that it can reasonably be sped up,
> so [[done]]. --[[Joey]]

View file

@ -0,0 +1,18 @@
[[!comment format=mdwn
username="joey"
subject="""comment 13"""
date="2021-05-31T18:40:59Z"
content="""
There was an unncessary check of the current time per sql insert, removing
that sped it up by 3 seconds in my benchmark.
Also tried increasing the number of inserts per sqlite transaction from 1k
to 10k. Memory use increased to 90 mb, but no measurable speed increase.
I don't see much else that can speed up the sqlite part, without going deep
into the weeds of populating sqlite databases without using sql, or using
multi-value inserts ([like described here](https://medium.com/@JasonWyatt/squeezing-performance-from-sqlite-insertions-971aff98eef2).
Both would prevent using persistent to abstract sql away, and would
only be usable in this case, not speeding up git-annex generally,
so not too enthused.
"""]]