git-annex/doc/todo/withExclusiveLock_blocking_issue.mdwn

Some parts of git-annex wait for an exclusive lock, and once they take it,
hold it while performing an operation. Now consider what happens if the
git-annex process is suspended. Another git-annex process that is running
and that waits to take the same exclusive lock (or a shared lock of the
same file) will stall forever, until the git-annex process is resumed.

These time windows tend to be small, but may not always be.

Is there any better way git-annex could handle this? Is it a significant
problem at all? I don't think I've ever seen it happen, but I rarely ^Z
git-annex either. How do other programs handle this, if at all?
--[[Joey]]

----

Would it be better for the second git-annex process, rather than hanging
indefinitely, to timeout after a few seconds?

But how many seconds? What if the system is under heavy load?

> What could be done is, update the lock's file's mtime after successfully
> taking the lock. Then, as long as the mtime is advancing, some other
> process is actively using it, and it's ok for our process to wait
> longer.
>
> (Updating the mtime would be a problem when locking annex object files
> in v9 and earlier. Luckily, that locking is not done with a blocking
> lock anyway.)

> If the lock file's mtime is being checked, the process that is
> blocking with the lock held could periodically update the mtime.
> A background thread could manage that. If that's done every ten seconds,
> then an mtime more than 20 seconds old indicates that the lock is
> held by a suspended process. So git-annex would stall for up to 20-30
> seconds before erroring out when a lock is held by a suspended process.
> That seems acceptible, it doesn't need to deal with this situation
> instantly, it just needs to not block indefinitely. And updating the
> mtime every 10 seconds should not be too much IO.
>
> When an old version of git-annex has the lock held, it won't be updating
> the mtime. So if it takes longer than 10 seconds to do the operation with
> the lock held, a new version may complain that it's suspended when it's
> really not. This could be avoided by checking what process holds the
> lock, and whether it's suspended. But probably 10 seconds is enough
> time for all the operations git-annex takes a blocking lock for
> currently to finish, and if so we don't need to worry about this situation?
>
> >  Unfortunately not: importKeys takes an exclusive lock and holds it while
> > downloading all the content! This seems like a bug though, because it can
> > cause other git-annex processes that are eg storing content in a remote
> > to block for a long time.
> >
> > Another one is Database.Export.writeLockDbWhile, which takes an
> > exclusive lock while running eg, Command.Export.changeExport,
> > which may sometimes need to do a lot of work.
> >
> > Another one is Annex.Queue.flush, which probably mostly runs in under
> > 10 seconds, but maybe not always, and when annex.queuesize is changed,
> > could surely take longer.
>
> To avoid problems when old git-annex's are also being used, it could
> update and check the mtime of a different file than the lock file.
>
> Start by trying to take the lock for up to 10 seconds. If it takes the
> lock, create the mtime file and start a thread that updates the mtime
> every 10 seconds until the lock is closed, and delete the mtime file
> before closing the lock handle.
>
> When it times out taking the lock, if the mtime file does not exist, an
> old git-annex has the lock; if the mtime file does exist, then check
> if its timestamp has advanced; if not then a new git-annex has the lock
> and is suspended and it can error out.
>
> Oops: There's a race in the method above; a timeout may occur
> right when the other process has taken the lock, but has not updated
> the mtime file yet. Then that process would incorrectly be treated
> as an old git-annex process.
>
> So: To support old git-annex, it seems it will need to check, when the
> lock is held, what process has the lock. And then check if that process
> is suspended or not. Which means looking in /proc. Ugh.
>
> Or: Change to checking lock mtimes only in git-annex v11..