More robust handling of ErrorBusy when writing to sqlite databases

While ErrorBusy and other exceptions were caught and the write retried for up to 10 seconds, it was still possible for git-annex to eventually give up and error out without writing to the database. Now it will retry as long as necessary. This does mean that, if one git-annex process is suspended just as sqlite has locked the database for writing, another git-annex that tries to write it it might get stuck retrying forever. But, that could already happen when opening the sqlite database, which retries forever on ErrorBusy. This is an area where git-annex is known to not behave well, there's a todo about the general case of it. Sponsored-by: Dartmouth College's Datalad project
2022-10-17 15:56:19 -04:00 · 2022-10-17 15:56:19 -04:00 · 3149a1e2fe
commit 3149a1e2fe
parent 0d762acf7e
4 changed files with 120 additions and 22 deletions
--- a/1
+++ b/1
@ -12,6 +12,7 @@ git-annex (10.20221004) UNRELEASED; urgency=medium
    by only asking for files under the prefix.
  * When importing from versioned remotes, fix tracking of the content
    of deleted files.
  * More robust handling of ErrorBusy when writing to sqlite databases.
 -- Joey Hess <id@joeyh.name>  Mon, 03 Oct 2022 13:36:42 -0400
--- a/Database/Handle.hs
+++ b/Database/Handle.hs
@ -82,22 +82,20 @@ queryDb (DbHandle _ jobs) a = do
 {- Writes a change to the database.
 -
- - Writes can fail if another write is happening concurrently.
+ - Writes can fail when another write is happening concurrently.
- - So write failures are caught and retried repeatedly for up to 10
+ - So write failures are caught and retried repeatedly.
 - seconds, which should avoid all but the most exceptional problems.
 -}
 commitDb :: DbHandle -> SqlPersistM () -> IO ()
-commitDb h wa = robustly Nothing 100 (commitDb' h wa)
+commitDb h wa = robustly (commitDb' h wa)
  where
-	robustly :: Maybe SomeException -> Int -> IO (Either SomeException ()) -> IO ()
+	robustly :: IO (Either SomeException ()) -> IO ()
-	robustly e 0 _ = error $ "failed to commit changes to sqlite database: " ++ show e
+	robustly a = do
 	robustly _ n a = do
 		r <- a
 		case r of
 			Right _ -> return ()
-			Left e -> do
+			Left _ -> do
 				threadDelay 100000 -- 1/10th second
-				robustly (Just e) (n-1) a
+				robustly a
 commitDb' :: DbHandle -> SqlPersistM () -> IO (Either SomeException ())
 commitDb' (DbHandle _ jobs) a = do
--- a/doc/bugs/get_is_busy_doing_nothing/comment_27_45168f110bded2f8c8f9777e1edda945._comment
+++ b/doc/bugs/get_is_busy_doing_nothing/comment_27_45168f110bded2f8c8f9777e1edda945._comment
@ -0,0 +1,36 @@
 [[!comment format=mdwn
 username="joey"
 subject="""comment 27"""
 date="2022-10-17T18:49:47Z"
 content="""
 [[todo/withExclusiveLock_blocking_issue]] does not have to be solved for
 every other lock in git-annex first. Since the sqlite database lock would
 be a new lock file, it could use the mtime update method described in there
 without backwards compatibility issues.
 ErrorBusy can also occur when opening a new database connection for read,
 but it retries that as often as necessary. Which does mean that suspending
 git-annex at just the wrong time can already cause other git-annex
 processes to stall forever waiting to read from the database. 
 So, in a way, it would be ok for write to also retry each time it gets
 ErrorBusy, rather than the current limited number of retries. If that does
 cause git-annex to block when another git-annex process is suspended, it
 would not be a new behavior.
 Also, the mtime file method described in
 [[todo/withExclusiveLock_blocking_issue]] could be used without a lock file
 in order to detect when a suspended process is causing ErrorBusy. And can
 avoid that situation for both writes and reads.
 So, plan: 
 1. Retry forever on ErrorBusy when writing to sqlite database.
   (I've made this change now... So I think probably this bug can't
   occur any longer.)
 2. While running opensettle and ChangeJob, have a background thread that 
   periodically updates a mtime file.
 3. If ErrorBusy is received repeatedly for some amount of time,
   check the mtime file. If it's not being updated, give up, since
   a suspended git-annex process apparently has the sqlite database locked.
 """]]
--- a/doc/todo/withExclusiveLock_blocking_issue.mdwn
+++ b/doc/todo/withExclusiveLock_blocking_issue.mdwn
@ -1,20 +1,83 @@
-Some parts of git-annex use withExclusiveLock or otherwise wait for an
+Some parts of git-annex wait for an exclusive lock, and once they take it,
-exclusive lock and hold it while performing an operation. Now consider what
+hold it while performing an operation. Now consider what happens if the
-happens if the git-annex process is suspended. Another git-annex process
+git-annex process is suspended. Another git-annex process that is running
-that is running and that blocks on the same lock will stall forever, until
+and that waits to take the same exclusive lock (or a shared lock of the
-the git-annex process is resumed.
+same file) will stall forever, until the git-annex process is resumed.
 These time windows tend to be small, but may not always be.
 Would it be better for the second git-annex process, rather than hanging
 indefinitely, to try to take the lock a few times over a few seconds, and
 then error out? The risk with doing that is, when 2 concurrent git-annex
 processes are running and taking the locks repeatedly, one might get
 unlucky, fail to take the lock, and error out, when waiting a little longer
 would have succeeded, because the other process is not holding the lock all
 the time.
 Is there any better way git-annex could handle this? Is it a significant
 problem at all? I don't think I've ever seen it happen, but I rarely ^Z
 git-annex either. How do other programs handle this, if at all?
 --[[Joey]]
 ----
 Would it be better for the second git-annex process, rather than hanging
 indefinitely, to timeout after a few seconds?
 But how many seconds? What if the system is under heavy load?
 > What could be done is, update the lock's file's mtime after successfully
 > taking the lock. Then, as long as the mtime is advancing, some other
 > process is actively using it, and it's ok for our process to wait
 > longer.
 > 
 > (Updating the mtime would be a problem when locking annex object files
 > in v9 and earlier. Luckily, that locking is not done with a blocking
 > lock anyway.)
 > If the lock file's mtime is being checked, the process that is
 > blocking with the lock held could periodically update the mtime.
 > A background thread could manage that. If that's done every ten seconds,
 > then an mtime more than 20 seconds old indicates that the lock is
 > held by a suspended process. So git-annex would stall for up to 20-30
 > seconds before erroring out when a lock is held by a suspended process.
 > That seems acceptible, it doesn't need to deal with this situation
 > instantly, it just needs to not block indefinitely. And updating the
 > mtime every 10 seconds should not be too much IO.
 > 
 > When an old version of git-annex has the lock held, it won't be updating
 > the mtime. So if it takes longer than 10 seconds to do the operation with
 > the lock held, a new version may complain that it's suspended when it's
 > really not. This could be avoided by checking what process holds the
 > lock, and whether it's suspended. But probably 10 seconds is enough
 > time for all the operations git-annex takes a blocking lock for
 > currently to finish, and if so we don't need to worry about this situation?
 > 
 > >  Unfortunately not: importKeys takes an exclusive lock and holds it while
 > > downloading all the content! This seems like a bug though, because it can
 > > cause other git-annex processes that are eg storing content in a remote
 > > to block for a long time.
 > > 
 > > Another one is Database.Export.writeLockDbWhile, which takes an
 > > exclusive lock while running eg, Command.Export.changeExport,
 > > which may sometimes need to do a lot of work.
 > >
 > > Another one is Annex.Queue.flush, which probably mostly runs in under
 > > 10 seconds, but maybe not always, and when annex.queuesize is changed,
 > > could surely take longer.
 > 
 > To avoid problems when old git-annex's are also being used, it could
 > update and check the mtime of a different file than the lock file.
 > 
 > Start by trying to take the lock for up to 10 seconds. If it takes the
 > lock, create the mtime file and start a thread that updates the mtime 
 > every 10 seconds until the lock is closed, and delete the mtime file
 > before closing the lock handle. 
 > 
 > When it times out taking the lock, if the mtime file does not exist, an
 > old git-annex has the lock; if the mtime file does exist, then check
 > if its timestamp has advanced; if not then a new git-annex has the lock
 > and is suspended and it can error out.
 > 
 > Oops: There's a race in the method above; a timeout may occur
 > right when the other process has taken the lock, but has not updated
 > the mtime file yet. Then that process would incorrectly be treated
 > as an old git-annex process.
 > 
 > So: To support old git-annex, it seems it will need to check, when the
 > lock is held, what process has the lock. And then check if that process
 > is suspended or not. Which means looking in /proc. Ugh.
 > 
 > Or: Change to checking lock mtimes only in git-annex v11..