[torqueusers] pbs_scheduler keeps dying

Garrick Staples garrick at usc.edu
Wed Nov 9 15:45:51 MST 2005


On Wed, Nov 09, 2005 at 11:10:08AM +0000, gianfranco sciacca alleged:
> On Tue, 2005-11-08 at 20:56, Garrick Staples wrote:
> > On Tue, Nov 08, 2005 at 12:08:25PM +0000, gianfranco sciacca alleged:
> > > On Tue, 2005-10-18 at 19:13, Garrick Staples wrote:
> > > > On Tue, Oct 18, 2005 at 03:58:35PM +0100, gianfranco sciacca alleged:
> > > > > We have been running torque with its stock scheduler for about 7 months 
> > > > > with little problems. All of a sudden, since a few days, the scheduler 
> > > > > keeps dying which is seriously disrupting our cluster operation. I should 
> > > > 
> > > > Can you get a gdb backtrace of it dieing?  Or maybe run it under
> > > > valgrind?
> > > 
> > > This is the first crash under valgrind. I append the log below.
> > [..snip..]
> > > ==2411== To see them, rerun with: --show-reachable=yes
> > > ==2413== Invalid read of size 4
> > > ==2413==    at 0x805A6FB: pbs_rescquery (pbsD_resc.c:207)
> > > ==2413==    by 0x8053B0C: check_nodes (check.c:484)
> > 
> > All of the pbs_sched segfaults point to a communication problem with
> > pbs_server where an error as occured but isn't reported correctly.  The
> > clients reads and acts on an invalid reply.
> > 
> > Dave found and fixed a bug that fits this profile for 2.0.0p0.  Looking
> > back through this thread I don't see which version you are using
> > (appears to be before 1.2.0p6), but you should be able to build and
> > install _only_ pbs_sched for this particular problem.
> 
> Thanks Garrick for the response. We are running version 1.2.0p2. I could
> build and install the pbs_sched only out of version 2.0.0p0, but I'd
> rather upgrade the server and moms as well at the first opportunity. Is
> the jump between the two versions advised, or should I go through steps?

It should be fine to upgrade.  We were just talking about upgrade
procedures the other day:
http://www.supercluster.org/pipermail/torqueusers/2005-November/002448.html

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051109/2f81f20f/attachment.bin


More information about the torqueusers mailing list