fix ssh warmup hang
Fix race condition in ssh warmup that caused git-annex to get stuck and never process some while when run with high levels of concurrency. So far, I've isolated the problem to processTranscript, which hangs reading output from ssh in this situation. I don't yet understand why processTranscript behaves that way. Since here we don't care about the ssh output, and only want to /dev/null it, changed to not use processTranscript, avoiding its problem. This commit was supported by the NSF-funded DataLad project.
This commit is contained in:
parent
7d83502329
commit
ac6f58d642
3 changed files with 32 additions and 3 deletions
|
@ -34,7 +34,6 @@ import Annex.Path
|
||||||
import Utility.Env
|
import Utility.Env
|
||||||
import Utility.FileSystemEncoding
|
import Utility.FileSystemEncoding
|
||||||
import Utility.Hash
|
import Utility.Hash
|
||||||
import Utility.Process.Transcript
|
|
||||||
import Types.CleanupActions
|
import Types.CleanupActions
|
||||||
import Types.Concurrency
|
import Types.Concurrency
|
||||||
import Git.Env
|
import Git.Env
|
||||||
|
@ -219,13 +218,17 @@ prepSocket socketfile gc sshhost sshparams = do
|
||||||
-- return True.
|
-- return True.
|
||||||
-- (Except there's an unlikely false positive where a forced
|
-- (Except there's an unlikely false positive where a forced
|
||||||
-- ssh command exits 255.)
|
-- ssh command exits 255.)
|
||||||
tryssh extraps = liftIO $ do
|
tryssh extraps = liftIO $ withNullHandle $ \nullh -> do
|
||||||
let p = proc "ssh" $ concat
|
let p = proc "ssh" $ concat
|
||||||
[ extraps
|
[ extraps
|
||||||
, toCommand sshparams
|
, toCommand sshparams
|
||||||
, [fromSshHost sshhost, "true"]
|
, [fromSshHost sshhost, "true"]
|
||||||
]
|
]
|
||||||
(_, exitcode) <- processTranscript'' p Nothing
|
(Nothing, Nothing, Nothing, pid) <- createProcess $ p
|
||||||
|
{ std_out = UseHandle nullh
|
||||||
|
, std_err = UseHandle nullh
|
||||||
|
}
|
||||||
|
exitcode <- waitForProcess pid
|
||||||
return $ case exitcode of
|
return $ case exitcode of
|
||||||
ExitFailure 255 -> False
|
ExitFailure 255 -> False
|
||||||
_ -> True
|
_ -> True
|
||||||
|
|
|
@ -33,6 +33,9 @@ git-annex (6.20180228) UNRELEASED; urgency=medium
|
||||||
the only copy of a file when it thought the tor remote had a copy.
|
the only copy of a file when it thought the tor remote had a copy.
|
||||||
* Better ssh connection warmup when using -J for concurrency.
|
* Better ssh connection warmup when using -J for concurrency.
|
||||||
Avoids ugly messages when forced ssh command is not git-annex-shell.
|
Avoids ugly messages when forced ssh command is not git-annex-shell.
|
||||||
|
* Fix race condition in ssh warmup that caused git-annex to get
|
||||||
|
stuck and never process some while when run with high levels of
|
||||||
|
concurrency.
|
||||||
* Note that Remote/Git.hs now contains AGPL licensed code,
|
* Note that Remote/Git.hs now contains AGPL licensed code,
|
||||||
thus the license of git-annex as a whole is AGPL. This was already
|
thus the license of git-annex as a whole is AGPL. This was already
|
||||||
the case when git-annex was built with the webapp enabled.
|
the case when git-annex was built with the webapp enabled.
|
||||||
|
|
|
@ -29,3 +29,26 @@ Interestingly, the debug log shows it only ran git-annex-shell p2pstdio
|
||||||
6 times, despite the concurrency of 20. So, the other 14 must have stalled
|
6 times, despite the concurrency of 20. So, the other 14 must have stalled
|
||||||
setting up the connection. Suggests the bug is in the connection pool
|
setting up the connection. Suggests the bug is in the connection pool
|
||||||
code.
|
code.
|
||||||
|
|
||||||
|
> The hang does not involve the connection pool code itself; a call to
|
||||||
|
> Annex.Ssh.sshCommand is hanging. So, this likely affected git-annex
|
||||||
|
> before p2pstdio, although its timings of calls to sshCommand may be
|
||||||
|
> exposing the problem.
|
||||||
|
>
|
||||||
|
> The hang is in prepSocket; all the threads enter makeconnection near the
|
||||||
|
> same time, and so all of them try to warm up the ssh connection at the
|
||||||
|
> same time. And somehow many of those executions of ssh hang.
|
||||||
|
> (Arguably there should be locking to prevent multiple threads doing
|
||||||
|
> this, but the actual overhead of multiple threads doing it may be
|
||||||
|
> smaller than the overhead of such added locking.)
|
||||||
|
>
|
||||||
|
> Why is makeconnection's use of processTranscript hanging?
|
||||||
|
> processTranscriot tries to read the process's output (ssh has none),
|
||||||
|
> and waiting for the output to get read is for some reason hanging
|
||||||
|
> forever, despite the ssh process becoming a zombie.
|
||||||
|
> Converted makeconnection to not use processTranscript,
|
||||||
|
> and that does seem to avoid the hang.
|
||||||
|
>
|
||||||
|
> So, this bug is left open only because processTranscript hangs in situations
|
||||||
|
> like this. No other uses of it involve concurrency, but we still need to
|
||||||
|
> get to the bottom of its hang.. --[[Joey]]
|
||||||
|
|
Loading…
Reference in a new issue