[torqueusers] node(s) not accepting jobs
Tony Schreiner
schreian at bc.edu
Tue Apr 28 13:17:33 MDT 2009
On a cluster of 62 nodes, with torque 2.1.10 and maui 3.2.6p19
overnight 2 nodes have stopped accepting jobs
partial pestat output
node40 free 0.00 7879 4 16069 231 0/0 0
node41 free 0.00 8067 4 16257 228 0/0 0
node42 free 0.00* 56481 8 58465 269 0/0 88
node43 excl 8.22 64561 8 66545 22975 1/1 8 156354
mikaels
node44 free 0.11* 64561 8 66545 267 0/0 64
node45 excl 8.07 64561 8 66545 21408 1/1 8 156060
NONE* 156227
there are jobs in the queue and get submitted to other nodes but not
to node42 and node44.
node40 and node41 are not eligible for the queue being run so it's ok
that they have no jobs.
Please note the last column on those 2 nodes which is the "tasks"
parameter and is non-zero
I have restarted pbs_mom on the nodes, also done momctl -C and momctl
-c all on those nodes.
There is nothing in the mom_priv directory associated with any job.
Thanks
Tony Schreiner
More information about the torqueusers
mailing list