[torqueusers] node(s) not accepting jobs

rishi pathak mailmaverick666 at gmail.com
Wed Apr 29 13:21:04 MDT 2009


What is the cpu load on those nodes. Any node health check scripts running.
What is their output.

On Wed, Apr 29, 2009 at 12:58 AM, Tony Schreiner <schreian at bc.edu> wrote:

>
> On Apr 28, 2009, at 3:17 PM, Tony Schreiner wrote:
>
> > On a cluster of 62 nodes, with torque 2.1.10 and maui 3.2.6p19
> >
> > overnight 2 nodes have stopped accepting jobs
> >
> > partial pestat output
> >
> >   node40  free  0.00    7879   4  16069    231  0/0    0
> >   node41  free  0.00    8067   4  16257    228  0/0    0
> >   node42  free  0.00*  56481   8  58465    269  0/0   88
> >   node43  excl  8.22   64561   8  66545  22975  1/1    8    156354
> > mikaels
> >   node44  free  0.11*  64561   8  66545    267  0/0   64
> >   node45  excl  8.07   64561   8  66545  21408  1/1    8    156060
> > NONE* 156227
> >
> > there are jobs in the queue and get submitted to other nodes but not
> > to node42 and node44.
> > node40 and node41 are not eligible for the queue being run so it's ok
> > that they have no jobs.
> >
> > Please note the last column on those 2 nodes which is the "tasks"
> > parameter and is non-zero
> >
> > I have restarted pbs_mom on the nodes, also done  momctl -C and momctl
> > -c all on those nodes.
> > There is nothing in the mom_priv directory associated with any job.
> >
>
>
> If I may add one more thing.
> An attempt to force a job to run on the node with qrun -H node42 JOBID
>
> gives the following error
> qrun: Resource temporarily unavailable REJHOST=node42 MSG=cannot
> allocate node 'node42' to job - node not currently available (nps
> needed/free: 1/0,  joblist: l.bc.edu 2.6.27.21-170.2.56.fc10.x86_64
> #1 ....
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Regards--
Rishi Pathak
Pune-Maharastra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090430/2aca1a5d/attachment.html 


More information about the torqueusers mailing list