[torqueusers] torque 4.2.6.1 & 4.2.7: job dies after restarting pbs_mom

David Beer dbeer at adaptivecomputing.com
Tue Mar 25 17:02:05 MDT 2014


The port that is opened by pbs_demux is dynamic. Port 0 seems like it
should be impossible though. The port is told to each node by the remote
mom who opens it. Checking the syslog/ mom log on the other moms in this
job might hold some clues.


On Tue, Mar 18, 2014 at 2:45 AM, Thomas Dargel <td at chemie.hu-berlin.de>wrote:

> There is no firewall active -- without restarting pbs_mom on a joined
> sister-
> node, the 'mpirun ....' jobs listed in job-script will be executed in a
> proper way.
>
> I'm not sure, but is this the port where open_demux tries to connect to???
>
>                                                 port: 0 ?????
>                                                   \/
> ...open_demux: cannot connect to 192.168.100.132:0
>
> Regards
>
>   Thomas.
>
>  >I'm curious about this message:
>  >
>  >Mar 17 13:39:16 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>  >open_demux, open_demux: cannot connect to 192.168.100.132:0
>  >
>  >It looks like it is unable to open a port to this address. Does the
> address
>  >look legitimate? Is it possible there's a firewall or some other setting
>  >preventing this connection from being successful? This error is causing
>  >jobs to fail to start - parallel jobs won't run without the demux.
>  >
>  >
>  >On Mon, Mar 17, 2014 at 7:02 AM, Thomas Dargel <td at
> chemie.hu-berlin.de>wrote:
>  >
>  >> Hi All,
>  >>
>  >>   I observed this behavior of torque after restarting a pbs_mom on a
> joined
>  >> sisternode of a parallel job: job is being killed, found these
> messages:
>  >>
>  >> job.e496:
>  >> [node32:06220] plm:tm: failed to spawn daemon, error code = 17000
>  >>
>  >> ../../mom_logs/20140317:
>  >> 03/17/2014 13:39:00;0008;   pbs_mom.6359;Svr;task_save;saving task in
>  >> /var/spool/torque/4261/mom_priv/jobs/496.borneo.TK
>  >> 03/17/2014 13:39:00;0002;   pbs_mom.6370;n/a;mom_close_poll;entered
>  >> 03/17/2014 13:39:20;0001;   pbs_mom.6359;Job;496.borneo;task not
> started,
>  >> 'orted', stdio setup failed (see syslog)
>  >> 03/17/2014 13:39:20;0008;   pbs_mom.6359;Job;496.borneo;ERROR:
>  received
>  >> request 'SPAWN_TASK' from 192.168.100.132:822 for job '496.borneo'
>  >> (cannot start
>  >> task)
>  >>
>  >> /var/log/messages:
>  >> Mar 17 13:39:00 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>  >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>  >> Mar 17 13:39:02 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>  >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>  >> Mar 17 13:39:04 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>  >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>  >> Mar 17 13:39:06 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>  >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>  >> Mar 17 13:39:08 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>  >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>  >> Mar 17 13:39:10 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>  >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>  >> Mar 17 13:39:12 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>  >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>  >> Mar 17 13:39:14 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>  >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>  >> Mar 17 13:39:16 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>  >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>  >> Mar 17 13:39:18 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>  >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>  >> Mar 17 13:39:20 node31 pbs_mom: LOG_ERROR::Inappropriate ioctl for
> device
>  >> (25)
>  >> in open_demux, open_demux: connect 192.168.100.132:0
>  >> Mar 17 13:39:20 node31 pbs_mom: LOG_ERROR::Inappropriate ioctl for
> device
>  >> (25)
>  >> in start_process, cannot open mux stdout port
>  >>
>  >> It occurs if several job invocations are listed in the job-script, the
> job
>  >> which is running during the restart of pbs_mom will finish fine, the
> error
>  >> will
>  >> occur during the start-up of the next parallel job out of the
> job-script.
>  >>
>  >> Torque: 4.2.6.1 and 4.2.7
>  >> Maui: 3.3.1
>  >> OS: SLES11SP3, 3.0.101-0.15
>  >>
>  >> The job won't be interferred if only the pbs_mom on the master_node of
> this
>  >> job is restarted.
>  >>
>  >> Since I'm very new to the 4.X.X version of torque, I'm not sure if
> this is
>  >> normal behavior. If not, could someone give me a hint how I can
> overcome
>  >> this??
>  >>
>  >> Thank you in advance,
>  >> regards
>  >>
>  >>   Thomas.
>  >>
>  >> _______________________________________________
>  >> torqueusers mailing list
>  >> torqueusers at supercluster.org
>  >> http://www.supercluster.org/mailman/listinfo/torqueusers
>  >>
>
>  >
>
>  >--
>  >David Beer | Senior Software Engineer
>  >Adaptive Computing
>  >-------------- next part --------------
>  >An HTML attachment was scrubbed...
>  >URL:
>
> http://www.supercluster.org/pipermail/torqueusers/attachments/20140317/bec50469/attachment.html
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140325/21e150d5/attachment-0001.html 


More information about the torqueusers mailing list