[torqueusers] torque 4.2.6.1 & 4.2.7: job dies after restarting pbs_mom

David Beer dbeer at adaptivecomputing.com
Mon Mar 17 14:33:15 MDT 2014


I'm curious about this message:

Mar 17 13:39:16 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
open_demux, open_demux: cannot connect to 192.168.100.132:0

It looks like it is unable to open a port to this address. Does the address
look legitimate? Is it possible there's a firewall or some other setting
preventing this connection from being successful? This error is causing
jobs to fail to start - parallel jobs won't run without the demux.


On Mon, Mar 17, 2014 at 7:02 AM, Thomas Dargel <td at chemie.hu-berlin.de>wrote:

> Hi All,
>
>   I observed this behavior of torque after restarting a pbs_mom on a joined
> sisternode of a parallel job: job is being killed, found these messages:
>
> job.e496:
> [node32:06220] plm:tm: failed to spawn daemon, error code = 17000
>
> ../../mom_logs/20140317:
> 03/17/2014 13:39:00;0008;   pbs_mom.6359;Svr;task_save;saving task in
> /var/spool/torque/4261/mom_priv/jobs/496.borneo.TK
> 03/17/2014 13:39:00;0002;   pbs_mom.6370;n/a;mom_close_poll;entered
> 03/17/2014 13:39:20;0001;   pbs_mom.6359;Job;496.borneo;task not started,
> 'orted', stdio setup failed (see syslog)
> 03/17/2014 13:39:20;0008;   pbs_mom.6359;Job;496.borneo;ERROR:    received
> request 'SPAWN_TASK' from 192.168.100.132:822 for job '496.borneo'
> (cannot start
> task)
>
> /var/log/messages:
> Mar 17 13:39:00 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
> open_demux, open_demux: cannot connect to 192.168.100.132:0
> Mar 17 13:39:02 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
> open_demux, open_demux: cannot connect to 192.168.100.132:0
> Mar 17 13:39:04 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
> open_demux, open_demux: cannot connect to 192.168.100.132:0
> Mar 17 13:39:06 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
> open_demux, open_demux: cannot connect to 192.168.100.132:0
> Mar 17 13:39:08 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
> open_demux, open_demux: cannot connect to 192.168.100.132:0
> Mar 17 13:39:10 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
> open_demux, open_demux: cannot connect to 192.168.100.132:0
> Mar 17 13:39:12 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
> open_demux, open_demux: cannot connect to 192.168.100.132:0
> Mar 17 13:39:14 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
> open_demux, open_demux: cannot connect to 192.168.100.132:0
> Mar 17 13:39:16 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
> open_demux, open_demux: cannot connect to 192.168.100.132:0
> Mar 17 13:39:18 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
> open_demux, open_demux: cannot connect to 192.168.100.132:0
> Mar 17 13:39:20 node31 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25)
> in open_demux, open_demux: connect 192.168.100.132:0
> Mar 17 13:39:20 node31 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
> (25)
> in start_process, cannot open mux stdout port
>
> It occurs if several job invocations are listed in the job-script, the job
> which is running during the restart of pbs_mom will finish fine, the error
> will
> occur during the start-up of the next parallel job out of the job-script.
>
> Torque: 4.2.6.1 and 4.2.7
> Maui: 3.3.1
> OS: SLES11SP3, 3.0.101-0.15
>
> The job won't be interferred if only the pbs_mom on the master_node of this
> job is restarted.
>
> Since I'm very new to the 4.X.X version of torque, I'm not sure if this is
> normal behavior. If not, could someone give me a hint how I can overcome
> this??
>
> Thank you in advance,
> regards
>
>   Thomas.
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140317/bec50469/attachment.html 


More information about the torqueusers mailing list