d2af6baaeb
The pipe's FDs got inherited by ssh and it did something that kept them open even once it exited. Probably involving passing them on to the ssh mux daemon. Set close on exec, and all is well. Kept Annex.Ssh not using processTranscript even though it no longer hangs when it does use it, just because processTranscript is overkill there. This commit was supported by the NSF-funded DataLad project.
66 lines
3 KiB
Markdown
66 lines
3 KiB
Markdown
Using git annex get -J20 of 1000 files from ssh remote on localhost,
|
|
I've thrice observed it to hang.
|
|
|
|
get 99 (from origin...) (checksum...) ok
|
|
get 992 (from origin...) (checksum...) ok
|
|
get 991 (from origin...) (checksum...) ok
|
|
get 993 (from origin...) (checksum...) ok
|
|
get 995 (from origin...) (checksum...) ok
|
|
get 1000 (from origin...)
|
|
get 1 (from origin...)
|
|
get 10 (from origin...)
|
|
get 108 (from origin...)
|
|
get 105 (from origin...)
|
|
[some more]
|
|
|
|
It seems it's trying to receive content of the last files listed, but has
|
|
hung somehow in the P2P protocol and not received the data.
|
|
Those are the only files not present. --[[Joey]]
|
|
|
|
The particular set of files it stalls on seems somewhat deterministic;
|
|
the sets have been the same at least twice.
|
|
|
|
Looking at --debug, it does not seem to get to the point of sending a P2P
|
|
request for the keys of the files that it stalls on.
|
|
|
|
So, a bug setting up the P2P ssh connection, it seems.
|
|
|
|
Interestingly, the debug log shows it only ran git-annex-shell p2pstdio
|
|
6 times, despite the concurrency of 20. So, the other 14 must have stalled
|
|
setting up the connection. Suggests the bug is in the connection pool
|
|
code.
|
|
|
|
> The hang does not involve the connection pool code itself; a call to
|
|
> Annex.Ssh.sshCommand is hanging. So, this likely affected git-annex
|
|
> before p2pstdio, although its timings of calls to sshCommand may be
|
|
> exposing the problem.
|
|
>
|
|
> The hang is in prepSocket; all the threads enter makeconnection near the
|
|
> same time, and so all of them try to warm up the ssh connection at the
|
|
> same time. And somehow many of those executions of ssh hang.
|
|
> (Arguably there should be locking to prevent multiple threads doing
|
|
> this, but the actual overhead of multiple threads doing it may be
|
|
> smaller than the overhead of such added locking.)
|
|
>
|
|
> Converted makeconnection to not use processTranscript,
|
|
> and that does seem to avoid the hang.
|
|
>
|
|
> Why is makeconnection's use of processTranscript hanging?
|
|
> processTranscriot tries to read the process's output (ssh has none),
|
|
> and waiting for the output to get read is for some reason hanging
|
|
> forever, despite the ssh process becoming a zombie. I actually
|
|
> rewrote the part of processtranscript that reads the process's input,
|
|
> to be a much simpler use of async, and that new implementation has the
|
|
> same problem. It's hanging, not throwing an exception. Most puzzling!
|
|
>
|
|
> Hmm.. Perhaps the write handle is staying open even after ssh exits?
|
|
> processTranscript sets up a pipe for the process to write to, and
|
|
> the ssh process may inherit the FDs for that pipe (other than as stdout and
|
|
> stderr). If the write handle
|
|
> remains open, reading from it would block. Since the ssh mux server is
|
|
> involved, and the handle might be passed to it or something, that seems
|
|
> at least possible as the cause. The windows version of createProcess
|
|
> does not use that pipe, and switching to use it does avoid the hang.
|
|
> Yep! Setting the pipe's FDs to close on exec did avoid the hang.
|
|
>
|
|
> [[done]] --[[Joey]]
|