[torqueusers] pbs_scheduler keeps dying
garrick at usc.edu
Wed Oct 19 12:33:10 MDT 2005
On Wed, Oct 19, 2005 at 05:57:51PM +0100, gianfranco sciacca alleged:
> Ronny wrote:
> > had the same problem - the scheduler was dying within its operation.
> > Try starting pbs_sched with option -a <TIMEOUT>.
> > This will increase the alarm time for one scheduler run to <TIMEOUT>
> > seconds. I use -a 600.
> I'll roll this in if it dies again and will let you know.
This implies that pbs_server is being slow and unresponsive. While we
don't want to ignore a crash bug in pbs_sched, it is worth looking at
latency problems in pbs_server.
Do you find that 'qstat' is slow to respond? If 'qstat' is slow, but
'pbsnodes -a' is fast, then be sure poll_jobs is True in qmgr.
If 'qstat' is fast, but 'qstat jobid1 jobid2 jobid3 jobid4' (substitute
real jobids) is slow, then what arch and OS are you running? I might
have a patch for you.
strace your pbs_server process. Is it scrolling system calls really
really fast? strace it with -r, is it spending more than a second in
any system calls?
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051019/43b6a0b0/attachment.bin
More information about the torqueusers