2018-03-14 23:14:05 +00:00
|
|
|
Using git annex get -J20 of 1000 files from ssh remote on localhost,
|
|
|
|
I've thrice observed it to hang.
|
|
|
|
|
|
|
|
get 99 (from origin...) (checksum...) ok
|
|
|
|
get 992 (from origin...) (checksum...) ok
|
|
|
|
get 991 (from origin...) (checksum...) ok
|
|
|
|
get 993 (from origin...) (checksum...) ok
|
|
|
|
get 995 (from origin...) (checksum...) ok
|
|
|
|
get 1000 (from origin...)
|
|
|
|
get 1 (from origin...)
|
|
|
|
get 10 (from origin...)
|
|
|
|
get 108 (from origin...)
|
|
|
|
get 105 (from origin...)
|
|
|
|
[some more]
|
|
|
|
|
|
|
|
It seems it's trying to receive content of the last files listed, but has
|
|
|
|
hung somehow in the P2P protocol and not received the data.
|
|
|
|
Those are the only files not present. --[[Joey]]
|
|
|
|
|
|
|
|
The particular set of files it stalls on seems somewhat deterministic;
|
|
|
|
the sets have been the same at least twice.
|
|
|
|
|
|
|
|
Looking at --debug, it does not seem to get to the point of sending a P2P
|
|
|
|
request for the keys of the files that it stalls on.
|
|
|
|
|
|
|
|
So, a bug setting up the P2P ssh connection, it seems.
|
|
|
|
|
|
|
|
Interestingly, the debug log shows it only ran git-annex-shell p2pstdio
|
|
|
|
6 times, despite the concurrency of 20. So, the other 14 must have stalled
|
|
|
|
setting up the connection. Suggests the bug is in the connection pool
|
|
|
|
code.
|
2018-03-15 19:00:19 +00:00
|
|
|
|
|
|
|
> The hang does not involve the connection pool code itself; a call to
|
|
|
|
> Annex.Ssh.sshCommand is hanging. So, this likely affected git-annex
|
|
|
|
> before p2pstdio, although its timings of calls to sshCommand may be
|
|
|
|
> exposing the problem.
|
|
|
|
>
|
|
|
|
> The hang is in prepSocket; all the threads enter makeconnection near the
|
|
|
|
> same time, and so all of them try to warm up the ssh connection at the
|
|
|
|
> same time. And somehow many of those executions of ssh hang.
|
|
|
|
> (Arguably there should be locking to prevent multiple threads doing
|
|
|
|
> this, but the actual overhead of multiple threads doing it may be
|
|
|
|
> smaller than the overhead of such added locking.)
|
2018-03-15 20:14:22 +00:00
|
|
|
>
|
|
|
|
> Converted makeconnection to not use processTranscript,
|
|
|
|
> and that does seem to avoid the hang.
|
2018-03-15 19:00:19 +00:00
|
|
|
>
|
|
|
|
> Why is makeconnection's use of processTranscript hanging?
|
|
|
|
> processTranscriot tries to read the process's output (ssh has none),
|
|
|
|
> and waiting for the output to get read is for some reason hanging
|
2018-03-15 20:14:22 +00:00
|
|
|
> forever, despite the ssh process becoming a zombie. I actually
|
|
|
|
> rewrote the part of processtranscript that reads the process's input,
|
|
|
|
> to be a much simpler use of async, and that new implementation has the
|
|
|
|
> same problem. It's hanging, not throwing an exception. Most puzzling!
|
|
|
|
>
|
|
|
|
> Hmm.. Perhaps the write handle is staying open even after ssh exits?
|
|
|
|
> processTranscript sets up a pipe for the process to write to, and
|
|
|
|
> the ssh process may inherit the FDs for that pipe (other than as stdout and
|
|
|
|
> stderr). If the write handle
|
|
|
|
> remains open, reading from it would block. Since the ssh mux server is
|
|
|
|
> involved, and the handle might be passed to it or something, that seems
|
|
|
|
> at least possible as the cause. The windows version of createProcess
|
|
|
|
> does not use that pipe, and switching to use it does avoid the hang.
|
|
|
|
> Yep! Setting the pipe's FDs to close on exec did avoid the hang.
|
2018-03-15 19:00:19 +00:00
|
|
|
>
|
2018-03-15 20:14:22 +00:00
|
|
|
> [[done]] --[[Joey]]
|