[torqueusers] torque 4.2.6.1 & 4.2.7: job dies after restarting pbs_mom

Thomas Dargel td at chemie.hu-berlin.de
Tue Mar 18 02:45:38 MDT 2014


There is no firewall active -- without restarting pbs_mom on a joined sister-
node, the 'mpirun ....' jobs listed in job-script will be executed in a
proper way.

I'm not sure, but is this the port where open_demux tries to connect to???

                                                port: 0 ?????
                                                  \/
...open_demux: cannot connect to 192.168.100.132:0

Regards

  Thomas.

 >I'm curious about this message:
 >
 >Mar 17 13:39:16 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
 >open_demux, open_demux: cannot connect to 192.168.100.132:0
 >
 >It looks like it is unable to open a port to this address. Does the address
 >look legitimate? Is it possible there's a firewall or some other setting
 >preventing this connection from being successful? This error is causing
 >jobs to fail to start - parallel jobs won't run without the demux.
 >
 >
 >On Mon, Mar 17, 2014 at 7:02 AM, Thomas Dargel <td at chemie.hu-berlin.de>wrote:
 >
 >> Hi All,
 >>
 >>   I observed this behavior of torque after restarting a pbs_mom on a joined
 >> sisternode of a parallel job: job is being killed, found these messages:
 >>
 >> job.e496:
 >> [node32:06220] plm:tm: failed to spawn daemon, error code = 17000
 >>
 >> ../../mom_logs/20140317:
 >> 03/17/2014 13:39:00;0008;   pbs_mom.6359;Svr;task_save;saving task in
 >> /var/spool/torque/4261/mom_priv/jobs/496.borneo.TK
 >> 03/17/2014 13:39:00;0002;   pbs_mom.6370;n/a;mom_close_poll;entered
 >> 03/17/2014 13:39:20;0001;   pbs_mom.6359;Job;496.borneo;task not started,
 >> 'orted', stdio setup failed (see syslog)
 >> 03/17/2014 13:39:20;0008;   pbs_mom.6359;Job;496.borneo;ERROR:    received
 >> request 'SPAWN_TASK' from 192.168.100.132:822 for job '496.borneo'
 >> (cannot start
 >> task)
 >>
 >> /var/log/messages:
 >> Mar 17 13:39:00 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
 >> open_demux, open_demux: cannot connect to 192.168.100.132:0
 >> Mar 17 13:39:02 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
 >> open_demux, open_demux: cannot connect to 192.168.100.132:0
 >> Mar 17 13:39:04 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
 >> open_demux, open_demux: cannot connect to 192.168.100.132:0
 >> Mar 17 13:39:06 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
 >> open_demux, open_demux: cannot connect to 192.168.100.132:0
 >> Mar 17 13:39:08 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
 >> open_demux, open_demux: cannot connect to 192.168.100.132:0
 >> Mar 17 13:39:10 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
 >> open_demux, open_demux: cannot connect to 192.168.100.132:0
 >> Mar 17 13:39:12 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
 >> open_demux, open_demux: cannot connect to 192.168.100.132:0
 >> Mar 17 13:39:14 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
 >> open_demux, open_demux: cannot connect to 192.168.100.132:0
 >> Mar 17 13:39:16 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
 >> open_demux, open_demux: cannot connect to 192.168.100.132:0
 >> Mar 17 13:39:18 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
 >> open_demux, open_demux: cannot connect to 192.168.100.132:0
 >> Mar 17 13:39:20 node31 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
 >> (25)
 >> in open_demux, open_demux: connect 192.168.100.132:0
 >> Mar 17 13:39:20 node31 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
 >> (25)
 >> in start_process, cannot open mux stdout port
 >>
 >> It occurs if several job invocations are listed in the job-script, the job
 >> which is running during the restart of pbs_mom will finish fine, the error
 >> will
 >> occur during the start-up of the next parallel job out of the job-script.
 >>
 >> Torque: 4.2.6.1 and 4.2.7
 >> Maui: 3.3.1
 >> OS: SLES11SP3, 3.0.101-0.15
 >>
 >> The job won't be interferred if only the pbs_mom on the master_node of this
 >> job is restarted.
 >>
 >> Since I'm very new to the 4.X.X version of torque, I'm not sure if this is
 >> normal behavior. If not, could someone give me a hint how I can overcome
 >> this??
 >>
 >> Thank you in advance,
 >> regards
 >>
 >>   Thomas.
 >>
 >> _______________________________________________
 >> torqueusers mailing list
 >> torqueusers at supercluster.org
 >> http://www.supercluster.org/mailman/listinfo/torqueusers
 >>

 >

 >--
 >David Beer | Senior Software Engineer
 >Adaptive Computing
 >-------------- next part --------------
 >An HTML attachment was scrubbed...
 >URL: 
http://www.supercluster.org/pipermail/torqueusers/attachments/20140317/bec50469/attachment.html 




More information about the torqueusers mailing list