[torqueusers] node(s) not accepting jobs

Tony Schreiner schreian at bc.edu
Thu Apr 30 07:55:28 MDT 2009


Hi

The CPU load was near 0, and no health check scripts that I know of.  
ps didn't show any abnormal processes.

In the end, putting the 2 nodes offline, shutting them down,  
restarting pbs_server, and restarting the nodes fixed it.

Cheers,
Tony

On Apr 29, 2009, at 3:21 PM, rishi pathak wrote:

> What is the cpu load on those nodes. Any node health check scripts  
> running. What is their output.
>
> On Wed, Apr 29, 2009 at 12:58 AM, Tony Schreiner <schreian at bc.edu>  
> wrote:
>
> On Apr 28, 2009, at 3:17 PM, Tony Schreiner wrote:
>
> > On a cluster of 62 nodes, with torque 2.1.10 and maui 3.2.6p19
> >
> > overnight 2 nodes have stopped accepting jobs
> >
> > partial pestat output
> >
> >   node40  free  0.00    7879   4  16069    231  0/0    0
> >   node41  free  0.00    8067   4  16257    228  0/0    0
> >   node42  free  0.00*  56481   8  58465    269  0/0   88
> >   node43  excl  8.22   64561   8  66545  22975  1/1    8    156354
> > mikaels
> >   node44  free  0.11*  64561   8  66545    267  0/0   64
> >   node45  excl  8.07   64561   8  66545  21408  1/1    8    156060
> > NONE* 156227
> >
> > there are jobs in the queue and get submitted to other nodes but not
> > to node42 and node44.
> > node40 and node41 are not eligible for the queue being run so it's  
> ok
> > that they have no jobs.
> >
> > Please note the last column on those 2 nodes which is the "tasks"
> > parameter and is non-zero
> >
> > I have restarted pbs_mom on the nodes, also done  momctl -C and  
> momctl
> > -c all on those nodes.
> > There is nothing in the mom_priv directory associated with any job.
> >
>
>
> If I may add one more thing.
> An attempt to force a job to run on the node with qrun -H node42 JOBID
>
> gives the following error
> qrun: Resource temporarily unavailable REJHOST=node42 MSG=cannot
> allocate node 'node42' to job - node not currently available (nps
> needed/free: 1/0,  joblist: l.bc.edu 2.6.27.21-170.2.56.fc10.x86_64
> #1 ....
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
> -- 
> Regards--
> Rishi Pathak
> Pune-Maharastra

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090430/9e8c6668/attachment-0001.html 


More information about the torqueusers mailing list