[torqueusers] pbs_scheduler keeps dying

Garrick Staples garrick at usc.edu
Tue Nov 8 13:56:44 MST 2005


On Tue, Nov 08, 2005 at 12:08:25PM +0000, gianfranco sciacca alleged:
> On Tue, 2005-10-18 at 19:13, Garrick Staples wrote:
> > On Tue, Oct 18, 2005 at 03:58:35PM +0100, gianfranco sciacca alleged:
> > > We have been running torque with its stock scheduler for about 7 months 
> > > with little problems. All of a sudden, since a few days, the scheduler 
> > > keeps dying which is seriously disrupting our cluster operation. I should 
> > 
> > Can you get a gdb backtrace of it dieing?  Or maybe run it under
> > valgrind?
> 
> This is the first crash under valgrind. I append the log below.
[..snip..]
> ==2411== To see them, rerun with: --show-reachable=yes
> ==2413== Invalid read of size 4
> ==2413==    at 0x805A6FB: pbs_rescquery (pbsD_resc.c:207)
> ==2413==    by 0x8053B0C: check_nodes (check.c:484)

All of the pbs_sched segfaults point to a communication problem with
pbs_server where an error as occured but isn't reported correctly.  The
clients reads and acts on an invalid reply.

Dave found and fixed a bug that fits this profile for 2.0.0p0.  Looking
back through this thread I don't see which version you are using
(appears to be before 1.2.0p6), but you should be able to build and
install _only_ pbs_sched for this particular problem.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051108/0e2f0f32/attachment.bin


More information about the torqueusers mailing list