[torqueusers] A job can not exit

Weiguang Chen chenweiguang82 at gmail.com
Wed Sep 17 19:31:59 MDT 2008


Hi,
3 days ago, the controller node(node1) of our cluster was down by unknown
reason, and i had to restart it.
The queue jobs was still hold after restart, and the running jobs also is
still running.
But when a job is completed that can be sure by the output files is still
exist in the queue.
This job's state is marked "E", but this state was hold to now since
yesterday.
A error message showed "*qdel: Request invalid for state of job MSG=invalid
state for job - EXITING 3583.node1*" when i deleted  by using the command
"qdel jobid".
The other problem is the output of command "pbsnodes -a", the state of half
cluster nodes is "down,job-exclusive", but actually these nodes is not down.
It was useless when i modified the state of these nodes by qmgr "set node
nodeid state = job-exclusive", because still jobs running in these nodes.
I think these two problems are related.
How can i do?
Thanks

-- 
Best Wishes
ChenWeiguang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080918/351ec819/attachment.html


More information about the torqueusers mailing list