[torqueusers] nodes switching back to state down

Garrick Staples garrick at usc.edu
Tue Jan 17 15:41:51 MST 2006

On Tue, Jan 17, 2006 at 10:14:27AM +0100, Schulz, Henrik alleged:
> Thanks Garrick!
> Now some more information about this problem:
> 1. I compiled torque (mom and server) with --disable-rpp because without that option no node switched to state free.

That shouldn't have anything to do with it.  server<->mom communications
always use RPP.

> 2. The problem arises when a node loses its network connection. Then the node is of course down, but when the node is back or rebooted, it never switches to the state free. Only after restarting pbs_server.
> 3. While the node is down in such a way, mom_ctl says it has connection to the server:
> Host: cn49/cn49   Version: 2.0.0p4
> Server[0]: hn254 (connection is active)
>   WARNING:  no hello/cluster-addrs messages received from server
>   Init Msgs Sent:         54149 hellos
>   Last Msg From Server:   237319 seconds (DeleteJob)
>   Last Msg To Server:     2 seconds
> Server[1]: hn253 (connection is active)
>   Last Msg From Server:   18505 seconds (CLUSTER_ADDRS)
>   Last Msg To Server:     2 seconds
> Server[2]: hn252 (connection is active)
>   WARNING:  no hello/cluster-addrs messages received from server
>   Init Msgs Sent:         54149 hellos
>   WARNING:  no messages received from server
>   Last Msg To Server:     2 seconds
> HomeDirectory:          /var/spool/torque/mom_priv
> MOM active:             593473 seconds
> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
> JobList:                NONE
> diagnostics complete

Actually, the "connection is active" message is wrong.  It always says

Why do you have 3 servers configured?

The MOM is not getting any messages from the server.  That is bad.  Your
MOM is sending HELLO messages to your 3 servers and the server isn't
getting back to MOM.

Is there a problem with forward/reverse resolution with your nodes?

> 4. The pbsserver name is set in /var/spool/torque/server.name and
> there is the variable $clienthost set in
> /var/spool/torque/mom_priv/config. Is there another variable needed?

You want the server name in "/var/spool/torque/server_name".  If you
have 1 server, that's all you need.  If you have multiple servers, then
list all of them with $pbsserver.  $clienthost is depricated.

> 5. There are no errors in the log file of pbs_mom of the node.
> Is it necessary to start pbs_mom and pbs_server with a specification
> of the RPP port if --disable-rpp is configured? Or does it use a
> default port for the communication?

No, --disable-rpp only effects "Resource Monitor" requests used by
momctl and schedulers with old OpenPBS.  These messages are very rarely
used with modern TORQUE.

> Another point is (maybe the same problem): qmgr -c "l n cnXX" shows
> not only running jobs on the machine, but sometimes also a job which
> is finished since a couple of days. This does not affect running jobs,
> but it is a little bit strange.

That's because MOM and server aren't talking correctly.

> Henrik
> -----Urspr?ngliche Nachricht-----
> Von: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] Im Auftrag von Garrick Staples
> Gesendet: Donnerstag, 12. Januar 2006 19:41
> An: torqueusers at supercluster.org
> Betreff: Re: [torqueusers] nodes switching back to state down
> On Thu, Jan 12, 2006 at 09:55:09AM +0100, Schulz, Henrik alleged:
> > Hi,
> > 
> > I recently installed TORQUE v2.0.0p4. Now I have the problem that some
> > nodes (not all) are switching back to state down after setting them to
> > free with qmgr. This happens after a very short time (1-2 minutes).
> > During this time one can submit short jobs and these jobs are executed.
> If you had to manually set a node state to free, then a problem already
> exists.  "down" is not a state you have direct control over.  Overriding
> it in qmgr is only temporary.
> What does 'momctl -d 0 -h cn49' say?  Is $pbsserver properly set in the
> MOM config?
> Check the MOM log for errors.
> Make sure MOM's status_update_time (see pbs_mom manpage) jives with the
> server's node_check_rate (see pbs_server_attribute manpage.)
> > On the nodes the pbs_mom is running. Restarting pbs_mom or rebooting the
> > machine does not help. 
> > 
> > pbs_server log gives the following:
> > 
> > 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;attributes set:  at
> > request of schulzh at ...
> > 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;node cn49 state changed
> > from down to free
> > 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;attributes set: state =
> > free
> > 01/12/2006 09:50:29;0004;PBS_Server;Svr;check_nodes;node cn49 not
> > detected in 58830 seconds, marking node down
> > 01/12/2006 09:50:29;0040;PBS_Server;Req;update_node_state;node cn49
> > marked down
> > 
> > What is the problem here?
> That tells you the server isn't getting status updates from cn49 within
> the node_check_rate limit.
> -- 
> Garrick Staples, Linux/HPCC Administrator
> University of Southern California
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060117/6d6d5e94/attachment.bin

More information about the torqueusers mailing list