[torqueusers] node(s) not accepting jobs
Tony Schreiner
schreian at bc.edu
Tue Apr 28 13:28:55 MDT 2009
On Apr 28, 2009, at 3:17 PM, Tony Schreiner wrote:
> On a cluster of 62 nodes, with torque 2.1.10 and maui 3.2.6p19
>
> overnight 2 nodes have stopped accepting jobs
>
> partial pestat output
>
> node40 free 0.00 7879 4 16069 231 0/0 0
> node41 free 0.00 8067 4 16257 228 0/0 0
> node42 free 0.00* 56481 8 58465 269 0/0 88
> node43 excl 8.22 64561 8 66545 22975 1/1 8 156354
> mikaels
> node44 free 0.11* 64561 8 66545 267 0/0 64
> node45 excl 8.07 64561 8 66545 21408 1/1 8 156060
> NONE* 156227
>
> there are jobs in the queue and get submitted to other nodes but not
> to node42 and node44.
> node40 and node41 are not eligible for the queue being run so it's ok
> that they have no jobs.
>
> Please note the last column on those 2 nodes which is the "tasks"
> parameter and is non-zero
>
> I have restarted pbs_mom on the nodes, also done momctl -C and momctl
> -c all on those nodes.
> There is nothing in the mom_priv directory associated with any job.
>
If I may add one more thing.
An attempt to force a job to run on the node with qrun -H node42 JOBID
gives the following error
qrun: Resource temporarily unavailable REJHOST=node42 MSG=cannot
allocate node 'node42' to job - node not currently available (nps
needed/free: 1/0, joblist: l.bc.edu 2.6.27.21-170.2.56.fc10.x86_64
#1 ....
More information about the torqueusers
mailing list