[torqueusers] Mom Terminates Job before it is run

Adam Emerich aemerich at us.ibm.com
Wed Jul 11 12:07:43 MDT 2007


torqueusers-bounces at supercluster.org wrote on 07/10/2007 01:49:28 PM:

> On Tue, Jul 10, 2007 at 10:03:16AM -0500, Adam Emerich alleged:
> > Jul 10 09:30:37 n01-001-0 pbs_mom: Success (0) in TMomFinalizeChild,
cannot
> > open qsub sock
>
> For interactive jobs, qsub creates and listens on a socket, and pbs_mom
> must be able to connect to that socket.  If pbs_mom can't connect to
> qsub, then job launch is a failure.

In our case the management node has a hostname of mn, so the torque request
to the mom is coming down with mn as the hostname.  The mom could
communicate with the management node but needs to do so using mnc as the
hostname for the management node.  Does the torque socket that is opened
have limitations on where the connection comes from to have a successful
connection?  In our case the outgoing traffic from the management node goes
out on ip 172.16.0.1, however when the compute nodes communicate back to
the management nodes, they will use 172.15.0.1.

>
> --
> Garrick Staples, GNU/Linux HPCC SysAdmin
> University of Southern California
>
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
> [attachment "attdfxul.dat" deleted by Adam Emerich/Rochester/IBM]
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list