[torqueusers] Jobs not terminating

Garrick Staples garrick at usc.edu
Thu Mar 30 15:25:52 MST 2006


On Thu, Mar 30, 2006 at 08:57:31AM -0500, Tom Combs alleged:
>  My hosts file looks to be correct so this is not the issue. Here is a 
> sample:
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
> 
> 127.0.0.1       localhost
> 192.0.0.10      cmt node-0       # server_name is cmt - this comment is 
> not part of hosts file....
> 192.0.0.11      node-1
> 192.0.0.12      node-2
> 192.0.0.13      node-3
> 
> Here is a momctl from one of the nodes:
> [root at node-55 sbin]# ./momctl -d 3
> 
> Host: node-55/node-55   Version: 2.0.0p8
> Server[0]: cmt (connection is active)
>  WARNING:  no hello/cluster-addrs messages received from server

That's bad.  This node wouldn't have received the addr list from the
server.


>  Init Msgs Sent:         7053 hellos
>  Last Msg From Server:   69831 seconds (StatusJob)
>  Last Msg To Server:     7 seconds
> PID:                    4228
> HomeDirectory:          /opt/torque/mom_priv
> MOM active:             70632 seconds
> Server Update Interval: 45 seconds
> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model:    RPP
> TCP Timeout:            20 seconds
> NOTE:  no prolog configured
> Alarm Time:             0 of 10 seconds
> Trusted Client List:    192.0.0.10,192.0.0.65,127.0.0.1

Yup!  Your addr list is missing the other nodes.

Does 'pbsnodes -r node-55' or restarting pbs_server fix this?

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060330/413deafe/attachment.bin


More information about the torqueusers mailing list