[torqueusers] node(s) not accepting jobs

Tony Schreiner schreian at bc.edu
Tue Apr 28 13:17:33 MDT 2009


On a cluster of 62 nodes, with torque 2.1.10 and maui 3.2.6p19

overnight 2 nodes have stopped accepting jobs

partial pestat output

   node40  free  0.00    7879   4  16069    231  0/0    0
   node41  free  0.00    8067   4  16257    228  0/0    0
   node42  free  0.00*  56481   8  58465    269  0/0   88
   node43  excl  8.22   64561   8  66545  22975  1/1    8    156354  
mikaels
   node44  free  0.11*  64561   8  66545    267  0/0   64
   node45  excl  8.07   64561   8  66545  21408  1/1    8    156060  
NONE* 156227

there are jobs in the queue and get submitted to other nodes but not  
to node42 and node44.
node40 and node41 are not eligible for the queue being run so it's ok  
that they have no jobs.

Please note the last column on those 2 nodes which is the "tasks"  
parameter and is non-zero

I have restarted pbs_mom on the nodes, also done  momctl -C and momctl  
-c all on those nodes.
There is nothing in the mom_priv directory associated with any job.

Thanks
Tony Schreiner


More information about the torqueusers mailing list