[torqueusers] Jobs not terminating

Garrick Staples garrick at usc.edu
Fri Mar 31 19:30:45 MST 2006


On Fri, Mar 31, 2006 at 09:50:42AM -0500, Tom Combs alleged:
> Garrick, Thanks for your reply. I'm going to start over with my current
> status, thus no included responses....
> 
> Here's the problem:  I can submit jobs, they run with all the user 
> specified tasks
> being performed, but then hang at job termination.  The out and in files 
> remain
> in the spool directory and qstat shows the job as runnng. The owner of 
> the job
> has no processes running on the mom node. The repeated error in the 
> mom_log is:
> 
> pbs_mom;Req;jobobit;No contact with server at hostaddr c000000a,
>   port 15000,jobid 0.cmt errno 111

Translating c000000a to a dotted-quad, that's 192.0.0.10.  So this isn't
the "localhost" problem described earlier.

errno 111 is "connection refused".  Can the nodes reach port 15000 on
192.0.0.10?  Try running 'qstat' or 'qmgr' from the node.  FC4 has port
filtering rules by default.  'service iptables stop' on 192.0.0.10 might
fix this.

 
However, on closer inspection, port 15000 looks wrong.  That should be
15001.  Is pbs_mom started with -S?  Do 'getent services pbs/tcp' on a
node, it should print nothing or 15001.


> The are no unusual messages in the server_log. Following is other info, let
> me know if there is anything else that is needed. This 65 node cluster 
> of P4
> computers has been running fine for 3 years. I had issues on the master node
> and ended up doing a fresh install going to fedora4 from RH9 on it. Now I
> have my problem.  Thanks for the help.   -Tom Combs
> 
> *** Version & build:
> torque-2.0.0p8
> configure --prefix=/usr/local/torque --set-server-home=/opt/torque 
> --with-scp
>    --set-default-server=cmt --enable-docs
> Fedora Core 4 on master, Redhad 9 on compute nodes

This was built seperately on both OSes, right?  You did an FC4 build and
a RH9 build?  I don't know if that is a factor, but it seems like the
right thing to do.

Please verify that /opt/torque/server_name contains "cmt" on the server
and all nodes.  Please verify the nodes don't have $clienthost,
$pbsserverhost, $restricted, etc. in /opt/torque/mom_priv/config.

Does /opt/torque/server_priv/torque.cfg have anything on the server
host?

 
> *** iptables and selinux off on both the master and mom node
> *** hostbased auth working for users both ways between the master and 
> mom node.

iptables is off.  Ok.  The problem is the port number.

 
> *** momctl -d 3    from the node (node-2):
> [root at node-2 sbin]# ./momctl -d 3
> Host: node-2/node-2   Version: 2.0.0p8
> Server[0]: cmt (connection is active)
>  Init Msgs Received:     1 hellos/1 cluster-addrs
>  Init Msgs Sent:         42 hellos
>  Last Msg From Server:   8 seconds (StatusJob)
>  Last Msg To Server:     40 seconds
> PID:                    14719
> HomeDirectory:          /opt/torque/mom_priv
> MOM active:             59013 seconds
> Server Update Interval: 45 seconds
> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)

Might want to bump the loglevel, you might see something more.


> Communication Model:    RPP
> TCP Timeout:            20 seconds
> NOTE:  no prolog configured
> Alarm Time:             0 of 10 seconds
> Trusted Client List:    192.0.0.10,192.0.0.12,127.0.0.1

Your client-list is still empty.  Did 'pbsnodes -r node-2' or restarting
pbs_server fix that?


> ****  netstat on master (abbreviated):
> [root at cmt init.d]# netstat -ntlp
> Active Internet connections (only servers)
> Proto Recv-Q Send-Q Local Address               Foreign 
> Address             State       PID/Program name
> tcp        0      0 0.0.0.0:15001               
> 0.0.0.0:*                   LISTEN      16205/pbs_server
> tcp        0      0 146.201.237.110:15004       
> 0.0.0.0:*                   LISTEN      16201/pbs_sched

pbs_server is on the right port.  Odd that MOMs are trying to use the
wrong port.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060331/e687bbb8/attachment-0001.bin


More information about the torqueusers mailing list