2022-10-17 19:56:19 +00:00
|
|
|
Some parts of git-annex wait for an exclusive lock, and once they take it,
|
|
|
|
hold it while performing an operation. Now consider what happens if the
|
|
|
|
git-annex process is suspended. Another git-annex process that is running
|
|
|
|
and that waits to take the same exclusive lock (or a shared lock of the
|
|
|
|
same file) will stall forever, until the git-annex process is resumed.
|
2022-10-12 19:53:56 +00:00
|
|
|
|
|
|
|
These time windows tend to be small, but may not always be.
|
|
|
|
|
|
|
|
Is there any better way git-annex could handle this? Is it a significant
|
|
|
|
problem at all? I don't think I've ever seen it happen, but I rarely ^Z
|
|
|
|
git-annex either. How do other programs handle this, if at all?
|
|
|
|
--[[Joey]]
|
2022-10-17 19:56:19 +00:00
|
|
|
|
|
|
|
----
|
|
|
|
|
|
|
|
Would it be better for the second git-annex process, rather than hanging
|
|
|
|
indefinitely, to timeout after a few seconds?
|
|
|
|
|
|
|
|
But how many seconds? What if the system is under heavy load?
|
|
|
|
|
|
|
|
> What could be done is, update the lock's file's mtime after successfully
|
|
|
|
> taking the lock. Then, as long as the mtime is advancing, some other
|
|
|
|
> process is actively using it, and it's ok for our process to wait
|
|
|
|
> longer.
|
|
|
|
>
|
|
|
|
> (Updating the mtime would be a problem when locking annex object files
|
|
|
|
> in v9 and earlier. Luckily, that locking is not done with a blocking
|
|
|
|
> lock anyway.)
|
|
|
|
|
|
|
|
> If the lock file's mtime is being checked, the process that is
|
|
|
|
> blocking with the lock held could periodically update the mtime.
|
|
|
|
> A background thread could manage that. If that's done every ten seconds,
|
|
|
|
> then an mtime more than 20 seconds old indicates that the lock is
|
|
|
|
> held by a suspended process. So git-annex would stall for up to 20-30
|
|
|
|
> seconds before erroring out when a lock is held by a suspended process.
|
|
|
|
> That seems acceptible, it doesn't need to deal with this situation
|
|
|
|
> instantly, it just needs to not block indefinitely. And updating the
|
|
|
|
> mtime every 10 seconds should not be too much IO.
|
|
|
|
>
|
|
|
|
> When an old version of git-annex has the lock held, it won't be updating
|
|
|
|
> the mtime. So if it takes longer than 10 seconds to do the operation with
|
|
|
|
> the lock held, a new version may complain that it's suspended when it's
|
|
|
|
> really not. This could be avoided by checking what process holds the
|
|
|
|
> lock, and whether it's suspended. But probably 10 seconds is enough
|
|
|
|
> time for all the operations git-annex takes a blocking lock for
|
|
|
|
> currently to finish, and if so we don't need to worry about this situation?
|
|
|
|
>
|
|
|
|
> > Unfortunately not: importKeys takes an exclusive lock and holds it while
|
|
|
|
> > downloading all the content! This seems like a bug though, because it can
|
|
|
|
> > cause other git-annex processes that are eg storing content in a remote
|
|
|
|
> > to block for a long time.
|
|
|
|
> >
|
|
|
|
> > Another one is Database.Export.writeLockDbWhile, which takes an
|
|
|
|
> > exclusive lock while running eg, Command.Export.changeExport,
|
|
|
|
> > which may sometimes need to do a lot of work.
|
|
|
|
> >
|
|
|
|
> > Another one is Annex.Queue.flush, which probably mostly runs in under
|
|
|
|
> > 10 seconds, but maybe not always, and when annex.queuesize is changed,
|
|
|
|
> > could surely take longer.
|
|
|
|
>
|
|
|
|
> To avoid problems when old git-annex's are also being used, it could
|
|
|
|
> update and check the mtime of a different file than the lock file.
|
|
|
|
>
|
|
|
|
> Start by trying to take the lock for up to 10 seconds. If it takes the
|
|
|
|
> lock, create the mtime file and start a thread that updates the mtime
|
|
|
|
> every 10 seconds until the lock is closed, and delete the mtime file
|
|
|
|
> before closing the lock handle.
|
|
|
|
>
|
|
|
|
> When it times out taking the lock, if the mtime file does not exist, an
|
|
|
|
> old git-annex has the lock; if the mtime file does exist, then check
|
|
|
|
> if its timestamp has advanced; if not then a new git-annex has the lock
|
|
|
|
> and is suspended and it can error out.
|
|
|
|
>
|
|
|
|
> Oops: There's a race in the method above; a timeout may occur
|
|
|
|
> right when the other process has taken the lock, but has not updated
|
|
|
|
> the mtime file yet. Then that process would incorrectly be treated
|
|
|
|
> as an old git-annex process.
|
|
|
|
>
|
|
|
|
> So: To support old git-annex, it seems it will need to check, when the
|
|
|
|
> lock is held, what process has the lock. And then check if that process
|
|
|
|
> is suspended or not. Which means looking in /proc. Ugh.
|
|
|
|
>
|
|
|
|
> Or: Change to checking lock mtimes only in git-annex v11..
|