catch error statting pid lock file if it somehow does not exist

It ought to exist, since linkToLock has just created it. However,
Lustre seems to have a rather probabilisitic view of the contents of a
directory, so catching the error if it somehow does not exist and
running the same code path that would be ran if linkToLock failed
might avoid this fun Lustre failure.

Sponsored-by: Dartmouth College's Datalad project
This commit is contained in:
Joey Hess 2021-11-29 14:51:28 -04:00
parent 567f63ba47
commit a6699be79d
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
2 changed files with 32 additions and 13 deletions

View file

@ -154,14 +154,11 @@ tryLock lockfile = do
removeWhenExistsWith removeLink tmp' removeWhenExistsWith removeLink tmp'
return Nothing return Nothing
let tooklock st = return $ Just $ LockHandle abslockfile st sidelock let tooklock st = return $ Just $ LockHandle abslockfile st sidelock
ifM (linkToLock sidelock tmp' abslockfile) linkToLock sidelock tmp' abslockfile >>= \case
( do Just lckst -> do
removeWhenExistsWith removeLink tmp' removeWhenExistsWith removeLink tmp'
-- May not have made a hard link, so stat
-- the lockfile
lckst <- getFileStatus abslockfile
tooklock lckst tooklock lckst
, do Nothing -> do
v <- readPidLock abslockfile v <- readPidLock abslockfile
hn <- getHostName hn <- getHostName
tmpst <- getFileStatus tmp' tmpst <- getFileStatus tmp'
@ -175,7 +172,6 @@ tryLock lockfile = do
rename tmp' abslockfile rename tmp' abslockfile
tooklock tmpst tooklock tmpst
_ -> failedlock tmpst _ -> failedlock tmpst
)
-- Linux's open(2) man page recommends linking a pid lock into place, -- Linux's open(2) man page recommends linking a pid lock into place,
-- as the most portable atomic operation that will fail if -- as the most portable atomic operation that will fail if
@ -187,8 +183,8 @@ tryLock lockfile = do
-- --
-- However, not all filesystems support hard links. So, first probe -- However, not all filesystems support hard links. So, first probe
-- to see if they are supported. If not, use open with O_EXCL. -- to see if they are supported. If not, use open with O_EXCL.
linkToLock :: SideLockHandle -> RawFilePath -> RawFilePath -> IO Bool linkToLock :: SideLockHandle -> RawFilePath -> RawFilePath -> IO (Maybe FileStatus)
linkToLock Nothing _ _ = return False linkToLock Nothing _ _ = return Nothing
linkToLock (Just _) src dest = do linkToLock (Just _) src dest = do
let probe = src <> ".lnk" let probe = src <> ".lnk"
v <- tryIO $ createLink src probe v <- tryIO $ createLink src probe
@ -197,10 +193,13 @@ linkToLock (Just _) src dest = do
Right _ -> do Right _ -> do
_ <- tryIO $ createLink src dest _ <- tryIO $ createLink src dest
ifM (catchBoolIO checklinked) ifM (catchBoolIO checklinked)
( catchBoolIO $ not <$> checkInsaneLustre dest ( ifM (catchBoolIO $ not <$> checkInsaneLustre dest)
, return False ( catchMaybeIO $ getFileStatus dest
, return Nothing
)
, return Nothing
) )
Left _ -> catchBoolIO $ do Left _ -> catchMaybeIO $ do
let setup = do let setup = do
fd <- openFd dest WriteOnly fd <- openFd dest WriteOnly
(Just $ combineModes readModes) (Just $ combineModes readModes)
@ -209,7 +208,7 @@ linkToLock (Just _) src dest = do
let cleanup = hClose let cleanup = hClose
let go h = readFile (fromRawFilePath src) >>= hPutStr h let go h = readFile (fromRawFilePath src) >>= hPutStr h
bracket setup cleanup go bracket setup cleanup go
return True getFileStatus dest
where where
checklinked = do checklinked = do
x <- getSymbolicLinkStatus src x <- getSymbolicLinkStatus src

View file

@ -0,0 +1,20 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-11-29T18:11:30Z"
content="""
I think that this error comes from Utility.LockFile.PidLock.tryLock,
which has the only getFileStatus involving the pidlock whose exceptions
are not caught. The file is assumed to exist since it was just created,
and normally nothing deletes it.
While looking at where this might come from, I refreshed my memory of
how Lustre can to do insane stuff like having 2 different files with the
same name in a directory. Which checkInsaneLustre tries to deal with
by deleting one of them, but since this is all behavior undefined by POSIX,
maybe that sometimes deletes both of them. Or the file doesn't appear
after being created for some other POSIX-defying reason.
I've changed it to catch exceptions from that getFileStatus, which will
test this theory.
"""]]