[torqueusers] interactive qsub failure
mej at lbl.gov
Fri Apr 27 17:01:50 MDT 2012
On Friday, 27 April 2012, at 14:28:14 (-0700),
Kenneth Yoshimoto wrote:
> I'm seeing an intermittent failure with qsub -I
> The message in /var/log/messages is:
> Apr 27 14:07:27 gcn-17-71 pbs_mom: LOG_ERROR::Operation now in progress (115) in TMomFinalizeChild, cannot open interactive qsub socket to host gordon-ln4.local:50620 - 'cannot connect to port 1023 in client_to_svr - connection refused' - check routing tables/multi-homed host issues
> I think my routing is okay, as I can telnet to the the login node
> port from the compute node. I also see some packet exchange to
> the port with tcpdump. Could the mom be attempting the connection
> before qsub starts listening? I would have thought qsub would
> start listening before sending the job to pbs_server. Any ideas
> on what might cause this?
Are you by any chance seeing a SYN, a SYN/ACK, and a RST?
If so, try setting $max_conn_timeout_micro_sec to 500000 in your
pbs_mom config and see if that helps.
Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
More information about the torqueusers