[torqueusers] pbs_scheduler keeps dying
gs at hep.ucl.ac.uk
Wed Nov 9 04:10:08 MST 2005
On Tue, 2005-11-08 at 20:56, Garrick Staples wrote:
> On Tue, Nov 08, 2005 at 12:08:25PM +0000, gianfranco sciacca alleged:
> > On Tue, 2005-10-18 at 19:13, Garrick Staples wrote:
> > > On Tue, Oct 18, 2005 at 03:58:35PM +0100, gianfranco sciacca alleged:
> > > > We have been running torque with its stock scheduler for about 7 months
> > > > with little problems. All of a sudden, since a few days, the scheduler
> > > > keeps dying which is seriously disrupting our cluster operation. I should
> > >
> > > Can you get a gdb backtrace of it dieing? Or maybe run it under
> > > valgrind?
> > This is the first crash under valgrind. I append the log below.
> > ==2411== To see them, rerun with: --show-reachable=yes
> > ==2413== Invalid read of size 4
> > ==2413== at 0x805A6FB: pbs_rescquery (pbsD_resc.c:207)
> > ==2413== by 0x8053B0C: check_nodes (check.c:484)
> All of the pbs_sched segfaults point to a communication problem with
> pbs_server where an error as occured but isn't reported correctly. The
> clients reads and acts on an invalid reply.
> Dave found and fixed a bug that fits this profile for 2.0.0p0. Looking
> back through this thread I don't see which version you are using
> (appears to be before 1.2.0p6), but you should be able to build and
> install _only_ pbs_sched for this particular problem.
Thanks Garrick for the response. We are running version 1.2.0p2. I could
build and install the pbs_sched only out of version 2.0.0p0, but I'd
rather upgrade the server and moms as well at the first opportunity. Is
the jump between the two versions advised, or should I go through steps?
More information about the torqueusers