[torqueusers] nodes switching back to state down
garrick at usc.edu
Thu Jan 12 11:40:58 MST 2006
On Thu, Jan 12, 2006 at 09:55:09AM +0100, Schulz, Henrik alleged:
> I recently installed TORQUE v2.0.0p4. Now I have the problem that some
> nodes (not all) are switching back to state down after setting them to
> free with qmgr. This happens after a very short time (1-2 minutes).
> During this time one can submit short jobs and these jobs are executed.
If you had to manually set a node state to free, then a problem already
exists. "down" is not a state you have direct control over. Overriding
it in qmgr is only temporary.
What does 'momctl -d 0 -h cn49' say? Is $pbsserver properly set in the
Check the MOM log for errors.
Make sure MOM's status_update_time (see pbs_mom manpage) jives with the
server's node_check_rate (see pbs_server_attribute manpage.)
> On the nodes the pbs_mom is running. Restarting pbs_mom or rebooting the
> machine does not help.
> pbs_server log gives the following:
> 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;attributes set: at
> request of schulzh at ...
> 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;node cn49 state changed
> from down to free
> 01/12/2006 09:49:15;0004;PBS_Server;node;cn49;attributes set: state =
> 01/12/2006 09:50:29;0004;PBS_Server;Svr;check_nodes;node cn49 not
> detected in 58830 seconds, marking node down
> 01/12/2006 09:50:29;0040;PBS_Server;Req;update_node_state;node cn49
> marked down
> What is the problem here?
That tells you the server isn't getting status updates from cn49 within
the node_check_rate limit.
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060112/34c6caa7/attachment.bin
More information about the torqueusers