[torqueusers] nodes switching back to state down

Garrick Staples garrick at usc.edu
Thu Jan 12 11:40:58 MST 2006


On Thu, Jan 12, 2006 at 09:55:09AM +0100, Schulz, Henrik alleged:
> Hi,
> 
> I recently installed TORQUE v2.0.0p4. Now I have the problem that some
> nodes (not all) are switching back to state down after setting them to
> free with qmgr. This happens after a very short time (1-2 minutes).
> During this time one can submit short jobs and these jobs are executed.

If you had to manually set a node state to free, then a problem already
exists.  "down" is not a state you have direct control over.  Overriding
it in qmgr is only temporary.

What does 'momctl -d 0 -h cn49' say?  Is $pbsserver properly set in the
MOM config?

Check the MOM log for errors.

Make sure MOM's status_update_time (see pbs_mom manpage) jives with the
server's node_check_rate (see pbs_server_attribute manpage.)

 
> On the nodes the pbs_mom is running. Restarting pbs_mom or rebooting the
> machine does not help. 
> 
> pbs_server log gives the following:
> 
> 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;attributes set:  at
> request of schulzh at ...
> 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;node cn49 state changed
> from down to free
> 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;attributes set: state =
> free
> 01/12/2006 09:50:29;0004;PBS_Server;Svr;check_nodes;node cn49 not
> detected in 58830 seconds, marking node down
> 01/12/2006 09:50:29;0040;PBS_Server;Req;update_node_state;node cn49
> marked down
> 
> What is the problem here?

That tells you the server isn't getting status updates from cn49 within
the node_check_rate limit.


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060112/34c6caa7/attachment.bin


More information about the torqueusers mailing list