much flailing

This commit is contained in:
Joey Hess 2019-11-13 15:23:10 -04:00
parent 58be3af084
commit 0096db7b42
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 120 additions and 0 deletions

View file

@ -0,0 +1,20 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2019-11-13T16:34:34Z"
content="""
Reproduced.
After building git-annex with the DebugLocks flag, I got this:
debugLocks, called at ./Annex/Transfer.hs:248:18 in main:Annex.Transfer
debugLocks, called at ./CmdLine/Action.hs:263:26 in main:CmdLine.Action
Which points to pickRemote and ensureOnlyActionOn. But pickRemote
does no STM actions when there's only 1 remote, so it must really be
the latter.
Also, I notice that when 5 files to get are provided, it crashes, but with
less than 5, it succeeds.
Even this trivial case crashes: `git annex get -J1 1 2`
"""]]

View file

@ -0,0 +1,83 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2019-11-13T17:07:29Z"
content="""
Ok, I see the bug. ensureOnlyActionOn does a STM
retry if it finds in the activekeys map some other thread
is operating on the same key.
But, there is no running STM transaction what will update
the map. So, STM detects that the retry would deadlock.
It's not really a deadlock, because once the other thread finishes,
it will update the map to remove itself. But STM can't know that.
The solution will be to not use STM for waiting on the other thread.
Hmm, I tried the obvious approach, using a MVar semaphore to wait for the
thread, but that just resulted in more STM and MVar deadlocks.
I don't understand why after puzzling over it for two hours. I did
instrument all calls to atomically, and it looks, unfortunately, like
the one in finishCommandActions is deadlocking. If the problem extends
beyond ensureOnlyActionOn it may be much more complicated.
Patch that does not work and I don't know why.
diff --git a/CmdLine/Action.hs b/CmdLine/Action.hs
index 87298a95f..bf4bdd589 100644
--- a/CmdLine/Action.hs
+++ b/CmdLine/Action.hs
@@ -268,16 +268,30 @@ ensureOnlyActionOn k a = debugLocks $
go ConcurrentPerCpu = goconcurrent
goconcurrent = do
tv <- Annex.getState Annex.activekeys
- bracket (setup tv) id (const a)
- setup tv = liftIO $ do
+ bracketIO (setup tv) id (const a)
+ setup tv = do
+ mysem <- newEmptyMVar
mytid <- myThreadId
- atomically $ do
+ finishsetup <- atomically $ do
m <- readTVar tv
case M.lookup k m of
- Just tid
- | tid /= mytid -> retry
- | otherwise -> return $ return ()
+ Just (tid, theirsem)
+ | tid /= mytid -> return $ do
+ -- wait for the other
+ -- thread to finish, and
+ -- retry (STM retry would
+ -- deadlock)
+ readMVar theirsem
+ setup tv
+ | otherwise -> return $
+ -- same thread, so no
+ -- blocking
+ return $ return ()
Nothing -> do
- writeTVar tv $! M.insert k mytid m
- return $ liftIO $ atomically $
- modifyTVar tv $ M.delete k
+ writeTVar tv $! M.insert k (mytid, mysem) m
+ return $ return $ do
+ atomically $ modifyTVar tv $
+ M.delete k
+ -- indicate finished
+ putMVar mysem ()
+ finishsetup
diff --git a/Annex.hs b/Annex.hs
index 9eb4c5f39..936399ae7 100644
--- a/Annex.hs
+++ b/Annex.hs
@@ -143,7 +143,7 @@ data AnnexState = AnnexState
, existinghooks :: M.Map Git.Hook.Hook Bool
, desktopnotify :: DesktopNotify
, workers :: Maybe (TMVar (WorkerPool AnnexState))
- , activekeys :: TVar (M.Map Key ThreadId)
+ , activekeys :: TVar (M.Map Key (ThreadId, MVar ()))
, activeremotes :: MVar (M.Map (Types.Remote.RemoteA Annex) Integer)
, keysdbhandle :: Maybe Keys.DbHandle
, cachedcurrentbranch :: (Maybe (Maybe Git.Branch, Maybe Adjustment))
"""]]

View file

@ -0,0 +1,17 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2019-11-13T19:07:49Z"
content="""
Tried going back to c04b2af3e1a8316e7cf640046ad0aa68826650ed,
which is before the separation of perform and cleanup stages.
The same code was in onlyActionOn back then. And the test case does not
crash.
So, that gives a good commit to start a bisection. Which will probably
find the bug was introduced in the separation of perform and cleanup stages,
because that added a lot of STM complexity.
(Have to cherry-pick 018b5b81736a321f3eb9762a2afb7124e19dbdf9
onto those old commits to make them build with current libraries.)
"""]]