Seem there are several races that happen when 2 threads run PidLock.tryLock
at the same time. One involves checkSaneLock of the side lock file, which may
be deleted by another process that is dropping the lock, causing checkSaneLock
to fail. And even with the deletion disabled, it can still fail, Probably due
to linkToLock failing when a second thread overwrites the lock file.
The same can happen when 2 processes do, but then one process just fails
to take the lock, which is fine. But with 2 threads, some actions where failing
even though the process as a whole had the pid lock held.
Utility.LockPool.PidLock already maintains a STM lock, and since it uses
LockShared, 2 threads can hold the pidlock at the same time, and when
the first thread drops the lock, it will remain held by the second
thread, and so the pid lock file should not get deleted until the last
thread to hold it drops the lock. Which is the right behavior, and why a
LockShared STM lock is used in the first place.
The problem is that each time it takes the STM lock, it then also calls
PidLock.tryLock. So that was getting called repeatedly and concurrently.
Fixed by noticing when the shared lock is already held, and stop calling
PidLock.tryLock again, just use the pid lock that already exists then.
Also, LockFile.PidLock.tryLock was deleting the pid lock when it failed
to take the lock, which was entirely wrong. It should only drop the side
lock.
Sponsored-by: Dartmouth College's Datalad project
This fixes a reversion introduced in commit
ac56a5c2a0.
I didn't notice there that it was handling the case of a shared lock
file that was still open elsewhere by not running the close action.
This was especially deadly when annex.pidlock is set, as it caused early
deletion of the pid lock file.
Sponsored-by: Dartmouth College's Datalad project
nukeFile replaced with removeWhenExistsWith removeLink, which allows
using RawFilePath. Utility.Directory cannot use RawFilePath since setup
does not depend on posix.
This commit was sponsored by Graham Spencer on Patreon.
orElse is great, but was not the right thing to use here because
waitTakeLock could retry for other reasons than the lock being held,
which made tryTakeLock fail when it shouldn't.
Instead, move the code to tryTakeLock and implement waitTakeLock using
tryTakeLock and retry.
(Also, in runTransfer, when checkSaneLock fails, dropLock to avoid leaking a
lock handle.)
This commit was supported by the NSF-funded DataLad project.
This fixes behavior in this situation:
l1 <- lockShared Nothing "lck"
l2 <- lockShared Nothing "lck"
dropLock l1
dropLock l2
Before, the lock was dropped upon the second dropLock call, but the fd
remained open, and would never be closed while the program was running.
Fixed by a rather round-about method, but it should work well enough.
It would have been simpler to open open the shared lock once, and not open
it again in the second call to lockShared. But, that's difficult to do
atomically.
This also affects Windows and PID locks, not just posix locks.
In the case of pid locks, multiple calls to waitLock within the same
process are allowed because the side lock is locked using a posix lock,
and so multiple exclusive locks can be taken in the same process. So,
this change fixes a similar problem with pid locks.
l1 <- waitLock (Seconds 1) "lck"
l2 <- waitLock (Seconds 1) "lck"
dropLock l1
dropLock l2
Here the l2 side lock fd remained open but not locked,
although the pid lock file was removed. After this change, the second
dropLock will close both fds to the side lock, and delete the pidlock.
Also, rename lockContent to lockContentExclusive
inAnnexSafe should perhaps be eliminated, and instead use
`lockContentShared inAnnex`. However, I'm waiting on that, as there are
only 2 call sites for inAnnexSafe and it's fiddly.
The one exception is in Utility.Daemon. As long as a process only
daemonizes once, which seems reasonable, and as long as it avoids calling
checkDaemon once it's already running as a daemon, the fcntl locking
gotchas won't be a problem there.
Annex.LockFile has it's own separate lock pool layer, which has been
renamed to LockCache. This is a persistent cache of locks that persist
until closed.
This is not quite done; lockContent stil needs to be converted.