[torqueusers] nodes switching back to state down

Schulz, Henrik H.Schulz at fz-rossendorf.de
Tue Jan 17 02:14:27 MST 2006


Thanks Garrick!

Now some more information about this problem:

1. I compiled torque (mom and server) with --disable-rpp because without that option no node switched to state free.

2. The problem arises when a node loses its network connection. Then the node is of course down, but when the node is back or rebooted, it never switches to the state free. Only after restarting pbs_server.

3. While the node is down in such a way, mom_ctl says it has connection to the server:

Host: cn49/cn49   Version: 2.0.0p4
Server[0]: hn254 (connection is active)
  WARNING:  no hello/cluster-addrs messages received from server
  Init Msgs Sent:         54149 hellos
  Last Msg From Server:   237319 seconds (DeleteJob)
  Last Msg To Server:     2 seconds
Server[1]: hn253 (connection is active)
  Last Msg From Server:   18505 seconds (CLUSTER_ADDRS)
  Last Msg To Server:     2 seconds
Server[2]: hn252 (connection is active)
  WARNING:  no hello/cluster-addrs messages received from server
  Init Msgs Sent:         54149 hellos
  WARNING:  no messages received from server
  Last Msg To Server:     2 seconds
HomeDirectory:          /var/spool/torque/mom_priv
MOM active:             593473 seconds
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
JobList:                NONE

diagnostics complete

4. The pbsserver name is set in /var/spool/torque/server.name and there is the variable $clienthost set in /var/spool/torque/mom_priv/config. Is there another variable needed?

5. There are no errors in the log file of pbs_mom of the node.

Is it necessary to start pbs_mom and pbs_server with a specification of the RPP port if --disable-rpp is configured? Or does it use a default port for the communication?

Another point is (maybe the same problem): qmgr -c "l n cnXX" shows not only running jobs on the machine, but sometimes also a job which is finished since a couple of days. This does not affect running jobs, but it is a little bit strange.


Henrik


-----Ursprüngliche Nachricht-----
Von: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] Im Auftrag von Garrick Staples
Gesendet: Donnerstag, 12. Januar 2006 19:41
An: torqueusers at supercluster.org
Betreff: Re: [torqueusers] nodes switching back to state down

On Thu, Jan 12, 2006 at 09:55:09AM +0100, Schulz, Henrik alleged:
> Hi,
> 
> I recently installed TORQUE v2.0.0p4. Now I have the problem that some
> nodes (not all) are switching back to state down after setting them to
> free with qmgr. This happens after a very short time (1-2 minutes).
> During this time one can submit short jobs and these jobs are executed.

If you had to manually set a node state to free, then a problem already
exists.  "down" is not a state you have direct control over.  Overriding
it in qmgr is only temporary.

What does 'momctl -d 0 -h cn49' say?  Is $pbsserver properly set in the
MOM config?

Check the MOM log for errors.

Make sure MOM's status_update_time (see pbs_mom manpage) jives with the
server's node_check_rate (see pbs_server_attribute manpage.)

 
> On the nodes the pbs_mom is running. Restarting pbs_mom or rebooting the
> machine does not help. 
> 
> pbs_server log gives the following:
> 
> 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;attributes set:  at
> request of schulzh at ...
> 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;node cn49 state changed
> from down to free
> 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;attributes set: state =
> free
> 01/12/2006 09:50:29;0004;PBS_Server;Svr;check_nodes;node cn49 not
> detected in 58830 seconds, marking node down
> 01/12/2006 09:50:29;0040;PBS_Server;Req;update_node_state;node cn49
> marked down
> 
> What is the problem here?

That tells you the server isn't getting status updates from cn49 within
the node_check_rate limit.


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California


More information about the torqueusers mailing list