[torqueusers] Jobs not terminating
Garrick Staples
garrick at usc.edu
Thu Mar 30 15:25:52 MST 2006
On Thu, Mar 30, 2006 at 08:57:31AM -0500, Tom Combs alleged:
> My hosts file looks to be correct so this is not the issue. Here is a
> sample:
> # Do not remove the following line, or various programs
> # that require network functionality will fail.
>
> 127.0.0.1 localhost
> 192.0.0.10 cmt node-0 # server_name is cmt - this comment is
> not part of hosts file....
> 192.0.0.11 node-1
> 192.0.0.12 node-2
> 192.0.0.13 node-3
>
> Here is a momctl from one of the nodes:
> [root at node-55 sbin]# ./momctl -d 3
>
> Host: node-55/node-55 Version: 2.0.0p8
> Server[0]: cmt (connection is active)
> WARNING: no hello/cluster-addrs messages received from server
That's bad. This node wouldn't have received the addr list from the
server.
> Init Msgs Sent: 7053 hellos
> Last Msg From Server: 69831 seconds (StatusJob)
> Last Msg To Server: 7 seconds
> PID: 4228
> HomeDirectory: /opt/torque/mom_priv
> MOM active: 70632 seconds
> Server Update Interval: 45 seconds
> LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model: RPP
> TCP Timeout: 20 seconds
> NOTE: no prolog configured
> Alarm Time: 0 of 10 seconds
> Trusted Client List: 192.0.0.10,192.0.0.65,127.0.0.1
Yup! Your addr list is missing the other nodes.
Does 'pbsnodes -r node-55' or restarting pbs_server fix this?
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060330/413deafe/attachment.bin
More information about the torqueusers
mailing list