diff --git a/CHANGELOG b/CHANGELOG index 6c045d7609..3f118af109 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -12,6 +12,7 @@ git-annex (10.20221004) UNRELEASED; urgency=medium by only asking for files under the prefix. * When importing from versioned remotes, fix tracking of the content of deleted files. + * More robust handling of ErrorBusy when writing to sqlite databases. -- Joey Hess Mon, 03 Oct 2022 13:36:42 -0400 diff --git a/Database/Handle.hs b/Database/Handle.hs index ce22d0a661..283eef4cf2 100644 --- a/Database/Handle.hs +++ b/Database/Handle.hs @@ -82,22 +82,20 @@ queryDb (DbHandle _ jobs) a = do {- Writes a change to the database. - - - Writes can fail if another write is happening concurrently. - - So write failures are caught and retried repeatedly for up to 10 - - seconds, which should avoid all but the most exceptional problems. + - Writes can fail when another write is happening concurrently. + - So write failures are caught and retried repeatedly. -} commitDb :: DbHandle -> SqlPersistM () -> IO () -commitDb h wa = robustly Nothing 100 (commitDb' h wa) +commitDb h wa = robustly (commitDb' h wa) where - robustly :: Maybe SomeException -> Int -> IO (Either SomeException ()) -> IO () - robustly e 0 _ = error $ "failed to commit changes to sqlite database: " ++ show e - robustly _ n a = do + robustly :: IO (Either SomeException ()) -> IO () + robustly a = do r <- a case r of Right _ -> return () - Left e -> do + Left _ -> do threadDelay 100000 -- 1/10th second - robustly (Just e) (n-1) a + robustly a commitDb' :: DbHandle -> SqlPersistM () -> IO (Either SomeException ()) commitDb' (DbHandle _ jobs) a = do diff --git a/doc/bugs/get_is_busy_doing_nothing/comment_27_45168f110bded2f8c8f9777e1edda945._comment b/doc/bugs/get_is_busy_doing_nothing/comment_27_45168f110bded2f8c8f9777e1edda945._comment new file mode 100644 index 0000000000..55307eb53d --- /dev/null +++ b/doc/bugs/get_is_busy_doing_nothing/comment_27_45168f110bded2f8c8f9777e1edda945._comment @@ -0,0 +1,36 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 27""" + date="2022-10-17T18:49:47Z" + content=""" +[[todo/withExclusiveLock_blocking_issue]] does not have to be solved for +every other lock in git-annex first. Since the sqlite database lock would +be a new lock file, it could use the mtime update method described in there +without backwards compatibility issues. + +ErrorBusy can also occur when opening a new database connection for read, +but it retries that as often as necessary. Which does mean that suspending +git-annex at just the wrong time can already cause other git-annex +processes to stall forever waiting to read from the database. + +So, in a way, it would be ok for write to also retry each time it gets +ErrorBusy, rather than the current limited number of retries. If that does +cause git-annex to block when another git-annex process is suspended, it +would not be a new behavior. + +Also, the mtime file method described in +[[todo/withExclusiveLock_blocking_issue]] could be used without a lock file +in order to detect when a suspended process is causing ErrorBusy. And can +avoid that situation for both writes and reads. + +So, plan: + +1. Retry forever on ErrorBusy when writing to sqlite database. + (I've made this change now... So I think probably this bug can't + occur any longer.) +2. While running opensettle and ChangeJob, have a background thread that + periodically updates a mtime file. +3. If ErrorBusy is received repeatedly for some amount of time, + check the mtime file. If it's not being updated, give up, since + a suspended git-annex process apparently has the sqlite database locked. +"""]] diff --git a/doc/todo/withExclusiveLock_blocking_issue.mdwn b/doc/todo/withExclusiveLock_blocking_issue.mdwn index 6915015da3..32c750ed7b 100644 --- a/doc/todo/withExclusiveLock_blocking_issue.mdwn +++ b/doc/todo/withExclusiveLock_blocking_issue.mdwn @@ -1,20 +1,83 @@ -Some parts of git-annex use withExclusiveLock or otherwise wait for an -exclusive lock and hold it while performing an operation. Now consider what -happens if the git-annex process is suspended. Another git-annex process -that is running and that blocks on the same lock will stall forever, until -the git-annex process is resumed. +Some parts of git-annex wait for an exclusive lock, and once they take it, +hold it while performing an operation. Now consider what happens if the +git-annex process is suspended. Another git-annex process that is running +and that waits to take the same exclusive lock (or a shared lock of the +same file) will stall forever, until the git-annex process is resumed. These time windows tend to be small, but may not always be. -Would it be better for the second git-annex process, rather than hanging -indefinitely, to try to take the lock a few times over a few seconds, and -then error out? The risk with doing that is, when 2 concurrent git-annex -processes are running and taking the locks repeatedly, one might get -unlucky, fail to take the lock, and error out, when waiting a little longer -would have succeeded, because the other process is not holding the lock all -the time. - Is there any better way git-annex could handle this? Is it a significant problem at all? I don't think I've ever seen it happen, but I rarely ^Z git-annex either. How do other programs handle this, if at all? --[[Joey]] + +---- + +Would it be better for the second git-annex process, rather than hanging +indefinitely, to timeout after a few seconds? + +But how many seconds? What if the system is under heavy load? + +> What could be done is, update the lock's file's mtime after successfully +> taking the lock. Then, as long as the mtime is advancing, some other +> process is actively using it, and it's ok for our process to wait +> longer. +> +> (Updating the mtime would be a problem when locking annex object files +> in v9 and earlier. Luckily, that locking is not done with a blocking +> lock anyway.) + +> If the lock file's mtime is being checked, the process that is +> blocking with the lock held could periodically update the mtime. +> A background thread could manage that. If that's done every ten seconds, +> then an mtime more than 20 seconds old indicates that the lock is +> held by a suspended process. So git-annex would stall for up to 20-30 +> seconds before erroring out when a lock is held by a suspended process. +> That seems acceptible, it doesn't need to deal with this situation +> instantly, it just needs to not block indefinitely. And updating the +> mtime every 10 seconds should not be too much IO. +> +> When an old version of git-annex has the lock held, it won't be updating +> the mtime. So if it takes longer than 10 seconds to do the operation with +> the lock held, a new version may complain that it's suspended when it's +> really not. This could be avoided by checking what process holds the +> lock, and whether it's suspended. But probably 10 seconds is enough +> time for all the operations git-annex takes a blocking lock for +> currently to finish, and if so we don't need to worry about this situation? +> +> > Unfortunately not: importKeys takes an exclusive lock and holds it while +> > downloading all the content! This seems like a bug though, because it can +> > cause other git-annex processes that are eg storing content in a remote +> > to block for a long time. +> > +> > Another one is Database.Export.writeLockDbWhile, which takes an +> > exclusive lock while running eg, Command.Export.changeExport, +> > which may sometimes need to do a lot of work. +> > +> > Another one is Annex.Queue.flush, which probably mostly runs in under +> > 10 seconds, but maybe not always, and when annex.queuesize is changed, +> > could surely take longer. +> +> To avoid problems when old git-annex's are also being used, it could +> update and check the mtime of a different file than the lock file. +> +> Start by trying to take the lock for up to 10 seconds. If it takes the +> lock, create the mtime file and start a thread that updates the mtime +> every 10 seconds until the lock is closed, and delete the mtime file +> before closing the lock handle. +> +> When it times out taking the lock, if the mtime file does not exist, an +> old git-annex has the lock; if the mtime file does exist, then check +> if its timestamp has advanced; if not then a new git-annex has the lock +> and is suspended and it can error out. +> +> Oops: There's a race in the method above; a timeout may occur +> right when the other process has taken the lock, but has not updated +> the mtime file yet. Then that process would incorrectly be treated +> as an old git-annex process. +> +> So: To support old git-annex, it seems it will need to check, when the +> lock is held, what process has the lock. And then check if that process +> is suspended or not. Which means looking in /proc. Ugh. +> +> Or: Change to checking lock mtimes only in git-annex v11..