retry when sqlite throws ErrorIO

I suspect this may be due to SQLITE_IOERR_SHORT_READ, but have not
verified.

I was able to reproduce it on Linux after running the test suite in a loop
for 1-3 hours until it failed.

The WAL mode entry change in 3963c5fcf5
may have hidden the problem I was seeing; I have not seen an ErrorIO
since then.
This commit is contained in:
Joey Hess 2018-10-30 18:03:03 -04:00
parent 3963c5fcf5
commit 1428568554
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 82 additions and 25 deletions

View file

@ -31,7 +31,6 @@ import Control.Exception (throwIO, BlockedIndefinitelyOnMVar(..))
import qualified Data.Text as T import qualified Data.Text as T
import Control.Monad.Trans.Resource (runResourceT) import Control.Monad.Trans.Resource (runResourceT)
import Control.Monad.Logger (runNoLoggingT) import Control.Monad.Logger (runNoLoggingT)
import Data.List
import System.IO import System.IO
{- A DbHandle is a reference to a worker thread that communicates with {- A DbHandle is a reference to a worker thread that communicates with
@ -175,36 +174,68 @@ workerThread db tablename jobs = go
liftIO (a (runSqliteRobustly tablename db)) liftIO (a (runSqliteRobustly tablename db))
loop loop
-- like runSqlite, but calls settle on the raw sql Connection. -- Like runSqlite, but more robust.
--
-- New database connections can sometimes take a while to become usable.
-- This may be due to WAL mode recovering after a crash, or perhaps a bug
-- like described in blob 500f777a6ab6c45ca5f9790e0a63575f8e3cb88f.
-- So, loop until a select succeeds; once one succeeds the connection will
-- stay usable.
--
-- And sqlite sometimes throws ErrorIO when there's not really an IO problem,
-- but perhaps just a short read(). That's caught and retried several times.
runSqliteRobustly :: TableName -> T.Text -> (SqlPersistM a) -> IO a runSqliteRobustly :: TableName -> T.Text -> (SqlPersistM a) -> IO a
runSqliteRobustly tablename db a = do runSqliteRobustly tablename db a = do
conn <- Sqlite.open db conn <- opensettle maxretries
settle conn go conn maxretries
runResourceT $ runNoLoggingT $
withSqlConn (wrapConnection conn) $
runSqlConn a
where where
-- Work around a bug in sqlite: New database connections can maxretries = 100 :: Int
-- sometimes take a while to become usable; select statements will
-- fail with ErrorBusy for some time. So, loop until a select rethrow msg e = throwIO $ userError $ show e ++ "(" ++ msg ++ ")"
-- succeeds; once one succeeds the connection will stay usable.
-- go conn retries = do
-- I reported this bug, but the sqlite developers did not respond. r <- try $ runResourceT $ runNoLoggingT $
-- Bug report is archived in blob 500f777a6ab6c45ca5f9790e0a63575f8e3cb88f withSqlConn (wrapConnection conn) $
-- in git-annex git repo. runSqlConn a
settle conn = do case r of
r <- tryNonAsync $ do Right v -> return v
Left ex@(Sqlite.SqliteException { Sqlite.seError = e })
| e == Sqlite.ErrorIO ->
let retries' = retries - 1
in if retries' < 1
then rethrow "after successful open" ex
else go conn retries'
| otherwise -> rethrow "after successful open" ex
opensettle retries = do
conn <- Sqlite.open db
settle conn retries
settle conn retries = do
r <- try $ do
stmt <- Sqlite.prepare conn nullselect stmt <- Sqlite.prepare conn nullselect
void $ Sqlite.step stmt void $ Sqlite.step stmt
void $ Sqlite.finalize stmt void $ Sqlite.finalize stmt
case r of case r of
Right _ -> return () Right _ -> return conn
Left e -> do Left ex@(Sqlite.SqliteException { Sqlite.seError = e })
if "ErrorBusy" `isInfixOf` show e | e == Sqlite.ErrorBusy -> do
then do -- Wait and retry any number of times; it
threadDelay 1000 -- 1/1000th second -- will stop being busy eventually.
settle conn briefdelay
else throwIO e settle conn retries
| e == Sqlite.ErrorIO -> do
-- Could be a real IO error,
-- so don't retry indefinitely.
Sqlite.close conn
briefdelay
let retries' = retries - 1
if retries' < 1
then rethrow "while opening database connection" ex
else opensettle retries'
| otherwise -> rethrow "while opening database connection" ex
-- This should succeed for any table. -- This should succeed for any table.
nullselect = T.pack $ "SELECT null from " ++ tablename ++ " limit 1" nullselect = T.pack $ "SELECT null from " ++ tablename ++ " limit 1"
briefdelay = threadDelay 1000 -- 1/1000th second

View file

@ -59,4 +59,4 @@ Oh yeah, I am still discovering this powerfull git annex tool.
In fact, collegues and I are forming a group during the process to exchange about different use cases, encountered problems and help each other. In fact, collegues and I are forming a group during the process to exchange about different use cases, encountered problems and help each other.
[[!meta title="sqlite crash on Windows running linux binary via WSL"]] [[!meta title="v7: intermittent sqlite ErrorIO crash (especially in WSL)"]]

View file

@ -0,0 +1,26 @@
[[!comment format=mdwn
username="joey"
subject="""comment 6"""
date="2018-10-30T16:23:21Z"
content="""
I saw a different "SQLite3 returned ErrorIO while attempting to perform step",
which seems likely to be closely related to this problem.
(The "step" action happens right after the "prepare" action.)
failed to commit changes to sqlite database: Just SQLite3 returned ErrorIO while attempting to perform step.
CallStack (from HasCallStack):
error, called at ./Database/Handle.hs:116:26 in main:Database.Handle
The test suite occasionally fails due to this, and it seems always in
`test_lock_v7_force`. Which notably deletes the sqlite database just before
the failure, and so causes it to be re-created.
Dumping the keys database after such a failure, it is freshly created,
contains the tables but no data has been written to it.
I've made git-annex catch ErrorIO and retry. Not sure if that fixes the bug,
but it may.
Please try, if you can, the new git-annex version 7, and see if this bug
still happens.
"""]]