[torqueusers] torque 4.2.6.1 & 4.2.7: job dies after restarting pbs_mom

Thomas Dargel td at chemie.hu-berlin.de
Wed Mar 26 11:21:31 MDT 2014


I couldn't find neither in the mom_logs/20140326 nor in the messages file some
clues to that story.

What I found is that:
if no job is running on the nodes, 'momctl -d 3 -h node33' shows such a "Trusted
Client List":
...,192.168.100.130:15003,192.168.100.131:15003,192.168.100.132:15003,\
192.168.100.133:15003,192.168.100.134:15003,....

If a job is executed on node33 (192.168.100.133) and node34 (192.168.100.134),
the list changes to:

...,192.168.100.130:15003,192.168.100.131:15003,192.168.100.132:15003,\
192.168.100.133:0,192.168.100.133:15003,192.168.100.134:0,192.168.100.134:15003,...

Where do these portnumbers "0" came from???

Regards,

  Thomas.


On 26.03.2014 00:02, David Beer wrote:
> The port that is opened by pbs_demux is dynamic. Port 0 seems like it should be
> impossible though. The port is told to each node by the remote mom who opens it.
> Checking the syslog/ mom log on the other moms in this job might hold some clues.
>
>
> On Tue, Mar 18, 2014 at 2:45 AM, Thomas Dargel <td at chemie.hu-berlin.de
> <mailto:td at chemie.hu-berlin.de>> wrote:
>
>     There is no firewall active -- without restarting pbs_mom on a joined sister-
>     node, the 'mpirun ....' jobs listed in job-script will be executed in a
>     proper way.
>
>     I'm not sure, but is this the port where open_demux tries to connect to???
>
>                                                      port: 0 ?????
>                                                        \/
>     ...open_demux: cannot connect to 192.168.100.132:0 <http://192.168.100.132:0>
>
>     Regards
>
>        Thomas.
>
>       >I'm curious about this message:
>       >
>       >Mar 17 13:39:16 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>       >open_demux, open_demux: cannot connect to 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >
>       >It looks like it is unable to open a port to this address. Does the address
>       >look legitimate? Is it possible there's a firewall or some other setting
>       >preventing this connection from being successful? This error is causing
>       >jobs to fail to start - parallel jobs won't run without the demux.
>       >
>       >
>       >On Mon, Mar 17, 2014 at 7:02 AM, Thomas Dargel <td at chemie.hu-berlin.de
>     <http://chemie.hu-berlin.de>>wrote:
>       >
>       >> Hi All,
>       >>
>       >>   I observed this behavior of torque after restarting a pbs_mom on a joined
>       >> sisternode of a parallel job: job is being killed, found these messages:
>       >>
>       >> job.e496:
>       >> [node32:06220] plm:tm: failed to spawn daemon, error code = 17000
>       >>
>       >> ../../mom_logs/20140317:
>       >> 03/17/2014 13:39:00;0008;   pbs_mom.6359;Svr;task_save;saving task in
>       >> /var/spool/torque/4261/mom_priv/jobs/496.borneo.TK <http://496.borneo.TK>
>       >> 03/17/2014 13:39:00;0002;   pbs_mom.6370;n/a;mom_close_poll;entered
>       >> 03/17/2014 13:39:20;0001;   pbs_mom.6359;Job;496.borneo;task not started,
>       >> 'orted', stdio setup failed (see syslog)
>       >> 03/17/2014 13:39:20;0008;   pbs_mom.6359;Job;496.borneo;ERROR:    received
>       >> request 'SPAWN_TASK' from 192.168.100.132:822
>     <http://192.168.100.132:822> for job '496.borneo'
>       >> (cannot start
>       >> task)
>       >>
>       >> /var/log/messages:
>       >> Mar 17 13:39:00 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>       >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >> Mar 17 13:39:02 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>       >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >> Mar 17 13:39:04 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>       >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >> Mar 17 13:39:06 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>       >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >> Mar 17 13:39:08 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>       >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >> Mar 17 13:39:10 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>       >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >> Mar 17 13:39:12 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>       >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >> Mar 17 13:39:14 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>       >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >> Mar 17 13:39:16 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>       >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >> Mar 17 13:39:18 node31 pbs_mom: LOG_ERROR::Connection refused (111) in
>       >> open_demux, open_demux: cannot connect to 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >> Mar 17 13:39:20 node31 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
>       >> (25)
>       >> in open_demux, open_demux: connect 192.168.100.132:0
>     <http://192.168.100.132:0>
>       >> Mar 17 13:39:20 node31 pbs_mom: LOG_ERROR::Inappropriate ioctl for device
>       >> (25)
>       >> in start_process, cannot open mux stdout port
>       >>
>       >> It occurs if several job invocations are listed in the job-script, the job
>       >> which is running during the restart of pbs_mom will finish fine, the error
>       >> will
>       >> occur during the start-up of the next parallel job out of the job-script.
>       >>
>       >> Torque: 4.2.6.1 and 4.2.7
>       >> Maui: 3.3.1
>       >> OS: SLES11SP3, 3.0.101-0.15
>       >>
>       >> The job won't be interferred if only the pbs_mom on the master_node of this
>       >> job is restarted.
>       >>
>       >> Since I'm very new to the 4.X.X version of torque, I'm not sure if this is
>       >> normal behavior. If not, could someone give me a hint how I can overcome
>       >> this??
>       >>
>       >> Thank you in advance,
>       >> regards
>       >>
>       >>   Thomas.
>       >>
>       >> _______________________________________________
>       >> torqueusers mailing list
>       >> torqueusers at supercluster.org <http://supercluster.org>
>       >> http://www.supercluster.org/mailman/listinfo/torqueusers
>       >>
>
>       >
>
>       >--
>       >David Beer | Senior Software Engineer
>       >Adaptive Computing
>       >-------------- next part --------------
>       >An HTML attachment was scrubbed...
>       >URL:
>     http://www.supercluster.org/pipermail/torqueusers/attachments/20140317/bec50469/attachment.html
>
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



More information about the torqueusers mailing list