[torqueusers] pbs_scheduler keeps dying

Garrick Staples garrick at usc.edu
Wed Oct 19 12:33:10 MDT 2005


On Wed, Oct 19, 2005 at 05:57:51PM +0100, gianfranco sciacca alleged:
> Ronny wrote:
> >  had the same problem - the scheduler was dying within its operation.
> > Try starting pbs_sched with option -a <TIMEOUT>.
> > This will increase the alarm time for one scheduler run to <TIMEOUT>
> > seconds.  I use -a 600.
>
> I'll roll this in if it dies again and will let you know.

This implies that pbs_server is being slow and unresponsive.  While we
don't want to ignore a crash bug in pbs_sched, it is worth looking at
latency problems in pbs_server.

Do you find that 'qstat' is slow to respond?  If 'qstat' is slow, but
'pbsnodes -a' is fast, then be sure poll_jobs is True in qmgr.

If 'qstat' is fast, but 'qstat jobid1 jobid2 jobid3 jobid4' (substitute
real jobids) is slow, then what arch and OS are you running?  I might
have a patch for you.

strace your pbs_server process.  Is it scrolling system calls really
really fast?  strace it with -r, is it spending more than a second in
any system calls?

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051019/43b6a0b0/attachment.bin


More information about the torqueusers mailing list