[torqueusers] A job can not exit
chenweiguang82 at gmail.com
Wed Sep 17 19:31:59 MDT 2008
3 days ago, the controller node(node1) of our cluster was down by unknown
reason, and i had to restart it.
The queue jobs was still hold after restart, and the running jobs also is
But when a job is completed that can be sure by the output files is still
exist in the queue.
This job's state is marked "E", but this state was hold to now since
A error message showed "*qdel: Request invalid for state of job MSG=invalid
state for job - EXITING 3583.node1*" when i deleted by using the command
The other problem is the output of command "pbsnodes -a", the state of half
cluster nodes is "down,job-exclusive", but actually these nodes is not down.
It was useless when i modified the state of these nodes by qmgr "set node
nodeid state = job-exclusive", because still jobs running in these nodes.
I think these two problems are related.
How can i do?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers