git-annex/doc/todo/withExclusiveLock_blocking_issue.mdwn

Some parts of git-annex wait for an exclusive lock, and once they take it,
hold it while performing an operation. Now consider what happens if the
git-annex process is suspended. Another git-annex process that is running
and that waits to take the same exclusive lock (or a shared lock of the
same file) will stall forever, until the git-annex process is resumed.

These time windows tend to be small, but may not always be.

Is there any better way git-annex could handle this? Is it a significant
problem at all? I don't think I've ever seen it happen, but I rarely ^Z
git-annex either. How do other programs handle this, if at all?
--[[Joey]]

----

Would it be better for the second git-annex process, rather than hanging
indefinitely, to timeout after a few seconds?

But how many seconds? What if the system is under heavy load?

> What could be done is, update the lock's file's mtime after successfully
> taking the lock. Then, as long as the mtime is advancing, some other
> process is actively using it, and it's ok for our process to wait
> longer.
> 
> (Updating the mtime would be a problem when locking annex object files
> in v9 and earlier. Luckily, that locking is not done with a blocking
> lock anyway.)

> If the lock file's mtime is being checked, the process that is
> blocking with the lock held could periodically update the mtime.
> A background thread could manage that. If that's done every ten seconds,
> then an mtime more than 20 seconds old indicates that the lock is
> held by a suspended process. So git-annex would stall for up to 20-30
> seconds before erroring out when a lock is held by a suspended process.
> That seems acceptible, it doesn't need to deal with this situation
> instantly, it just needs to not block indefinitely. And updating the
> mtime every 10 seconds should not be too much IO.
> 
> When an old version of git-annex has the lock held, it won't be updating
> the mtime. So if it takes longer than 10 seconds to do the operation with
> the lock held, a new version may complain that it's suspended when it's
> really not. This could be avoided by checking what process holds the
> lock, and whether it's suspended. But probably 10 seconds is enough
> time for all the operations git-annex takes a blocking lock for
> currently to finish, and if so we don't need to worry about this situation?
> 
> >  Unfortunately not: importKeys takes an exclusive lock and holds it while
> > downloading all the content! This seems like a bug though, because it can
> > cause other git-annex processes that are eg storing content in a remote
> > to block for a long time.
> > 
> > Another one is Database.Export.writeLockDbWhile, which takes an
> > exclusive lock while running eg, Command.Export.changeExport,
> > which may sometimes need to do a lot of work.
> >
> > Another one is Annex.Queue.flush, which probably mostly runs in under
> > 10 seconds, but maybe not always, and when annex.queuesize is changed,
> > could surely take longer.
> 
> To avoid problems when old git-annex's are also being used, it could
> update and check the mtime of a different file than the lock file.
> 
> Start by trying to take the lock for up to 10 seconds. If it takes the
> lock, create the mtime file and start a thread that updates the mtime 
> every 10 seconds until the lock is closed, and delete the mtime file
> before closing the lock handle. 
> 
> When it times out taking the lock, if the mtime file does not exist, an
> old git-annex has the lock; if the mtime file does exist, then check
> if its timestamp has advanced; if not then a new git-annex has the lock
> and is suspended and it can error out.
> 
> Oops: There's a race in the method above; a timeout may occur
> right when the other process has taken the lock, but has not updated
> the mtime file yet. Then that process would incorrectly be treated
> as an old git-annex process.
> 
> So: To support old git-annex, it seems it will need to check, when the
> lock is held, what process has the lock. And then check if that process
> is suspended or not. Which means looking in /proc. Ugh.
> 
> Or: Change to checking lock mtimes only in git-annex v11..
More robust handling of ErrorBusy when writing to sqlite databases While ErrorBusy and other exceptions were caught and the write retried for up to 10 seconds, it was still possible for git-annex to eventually give up and error out without writing to the database. Now it will retry as long as necessary. This does mean that, if one git-annex process is suspended just as sqlite has locked the database for writing, another git-annex that tries to write it it might get stuck retrying forever. But, that could already happen when opening the sqlite database, which retries forever on ErrorBusy. This is an area where git-annex is known to not behave well, there's a todo about the general case of it. Sponsored-by: Dartmouth College's Datalad project 2022-10-17 19:56:19 +00:00			`Some parts of git-annex wait for an exclusive lock, and once they take it,`
			`hold it while performing an operation. Now consider what happens if the`
			`git-annex process is suspended. Another git-annex process that is running`
			`and that waits to take the same exclusive lock (or a shared lock of the`
			`same file) will stall forever, until the git-annex process is resumed.`
update and open a todo about something I'm pondering 2022-10-12 19:53:56 +00:00
			`These time windows tend to be small, but may not always be.`

			`Is there any better way git-annex could handle this? Is it a significant`
			`problem at all? I don't think I've ever seen it happen, but I rarely ^Z`
			`git-annex either. How do other programs handle this, if at all?`
			`--[[Joey]]`
More robust handling of ErrorBusy when writing to sqlite databases While ErrorBusy and other exceptions were caught and the write retried for up to 10 seconds, it was still possible for git-annex to eventually give up and error out without writing to the database. Now it will retry as long as necessary. This does mean that, if one git-annex process is suspended just as sqlite has locked the database for writing, another git-annex that tries to write it it might get stuck retrying forever. But, that could already happen when opening the sqlite database, which retries forever on ErrorBusy. This is an area where git-annex is known to not behave well, there's a todo about the general case of it. Sponsored-by: Dartmouth College's Datalad project 2022-10-17 19:56:19 +00:00
			`----`

			`Would it be better for the second git-annex process, rather than hanging`
			`indefinitely, to timeout after a few seconds?`

			`But how many seconds? What if the system is under heavy load?`

			`> What could be done is, update the lock's file's mtime after successfully`
			`> taking the lock. Then, as long as the mtime is advancing, some other`
			`> process is actively using it, and it's ok for our process to wait`
			`> longer.`
			`>`
			`> (Updating the mtime would be a problem when locking annex object files`
			`> in v9 and earlier. Luckily, that locking is not done with a blocking`
			`> lock anyway.)`

			`> If the lock file's mtime is being checked, the process that is`
			`> blocking with the lock held could periodically update the mtime.`
			`> A background thread could manage that. If that's done every ten seconds,`
			`> then an mtime more than 20 seconds old indicates that the lock is`
			`> held by a suspended process. So git-annex would stall for up to 20-30`
			`> seconds before erroring out when a lock is held by a suspended process.`
			`> That seems acceptible, it doesn't need to deal with this situation`
			`> instantly, it just needs to not block indefinitely. And updating the`
			`> mtime every 10 seconds should not be too much IO.`
			`>`
			`> When an old version of git-annex has the lock held, it won't be updating`
			`> the mtime. So if it takes longer than 10 seconds to do the operation with`
			`> the lock held, a new version may complain that it's suspended when it's`
			`> really not. This could be avoided by checking what process holds the`
			`> lock, and whether it's suspended. But probably 10 seconds is enough`
			`> time for all the operations git-annex takes a blocking lock for`
			`> currently to finish, and if so we don't need to worry about this situation?`
			`>`
			`> > Unfortunately not: importKeys takes an exclusive lock and holds it while`
			`> > downloading all the content! This seems like a bug though, because it can`
			`> > cause other git-annex processes that are eg storing content in a remote`
			`> > to block for a long time.`
			`> >`
			`> > Another one is Database.Export.writeLockDbWhile, which takes an`
			`> > exclusive lock while running eg, Command.Export.changeExport,`
			`> > which may sometimes need to do a lot of work.`
			`> >`
			`> > Another one is Annex.Queue.flush, which probably mostly runs in under`
			`> > 10 seconds, but maybe not always, and when annex.queuesize is changed,`
			`> > could surely take longer.`
			`>`
			`> To avoid problems when old git-annex's are also being used, it could`
			`> update and check the mtime of a different file than the lock file.`
			`>`
			`> Start by trying to take the lock for up to 10 seconds. If it takes the`
			`> lock, create the mtime file and start a thread that updates the mtime`
			`> every 10 seconds until the lock is closed, and delete the mtime file`
			`> before closing the lock handle.`
			`>`
			`> When it times out taking the lock, if the mtime file does not exist, an`
			`> old git-annex has the lock; if the mtime file does exist, then check`
			`> if its timestamp has advanced; if not then a new git-annex has the lock`
			`> and is suspended and it can error out.`
			`>`
			`> Oops: There's a race in the method above; a timeout may occur`
			`> right when the other process has taken the lock, but has not updated`
			`> the mtime file yet. Then that process would incorrectly be treated`
			`> as an old git-annex process.`
			`>`
			`> So: To support old git-annex, it seems it will need to check, when the`
			`> lock is held, what process has the lock. And then check if that process`
			`> is suspended or not. Which means looking in /proc. Ugh.`
			`>`
			`> Or: Change to checking lock mtimes only in git-annex v11..`