[torqueusers] node(s) not accepting jobs

Tony Schreiner schreian at bc.edu
Tue Apr 28 13:28:55 MDT 2009


On Apr 28, 2009, at 3:17 PM, Tony Schreiner wrote:

> On a cluster of 62 nodes, with torque 2.1.10 and maui 3.2.6p19
>
> overnight 2 nodes have stopped accepting jobs
>
> partial pestat output
>
>   node40  free  0.00    7879   4  16069    231  0/0    0
>   node41  free  0.00    8067   4  16257    228  0/0    0
>   node42  free  0.00*  56481   8  58465    269  0/0   88
>   node43  excl  8.22   64561   8  66545  22975  1/1    8    156354
> mikaels
>   node44  free  0.11*  64561   8  66545    267  0/0   64
>   node45  excl  8.07   64561   8  66545  21408  1/1    8    156060
> NONE* 156227
>
> there are jobs in the queue and get submitted to other nodes but not
> to node42 and node44.
> node40 and node41 are not eligible for the queue being run so it's ok
> that they have no jobs.
>
> Please note the last column on those 2 nodes which is the "tasks"
> parameter and is non-zero
>
> I have restarted pbs_mom on the nodes, also done  momctl -C and momctl
> -c all on those nodes.
> There is nothing in the mom_priv directory associated with any job.
>


If I may add one more thing.
An attempt to force a job to run on the node with qrun -H node42 JOBID

gives the following error
qrun: Resource temporarily unavailable REJHOST=node42 MSG=cannot  
allocate node 'node42' to job - node not currently available (nps
needed/free: 1/0,  joblist: l.bc.edu 2.6.27.21-170.2.56.fc10.x86_64  
#1 ....


More information about the torqueusers mailing list