[torqueusers] pbs_scheduler keeps dying

gianfranco sciacca gs at hep.ucl.ac.uk
Mon Oct 24 04:07:00 MDT 2005


I've kept quiet as the scheduler no longer crashed since running under
valgrind. But I'll respond to your latest queries about possible server
latency.

On Wed, 2005-10-19 at 19:33, Garrick Staples wrote:
> On Wed, Oct 19, 2005 at 05:57:51PM +0100, gianfranco sciacca alleged:
> > Ronny wrote:
> > >  had the same problem - the scheduler was dying within its operation.
> > > Try starting pbs_sched with option -a <TIMEOUT>.
> > > This will increase the alarm time for one scheduler run to <TIMEOUT>
> > > seconds.  I use -a 600.
> >
> > I'll roll this in if it dies again and will let you know.
> 
> This implies that pbs_server is being slow and unresponsive.  While we
> don't want to ignore a crash bug in pbs_sched, it is worth looking at
> latency problems in pbs_server.
> 
> Do you find that 'qstat' is slow to respond?  If 'qstat' is slow, but
> 'pbsnodes -a' is fast, then be sure poll_jobs is True in qmgr.

both respond fast.

> If 'qstat' is fast, but 'qstat jobid1 jobid2 jobid3 jobid4' (substitute
> real jobids) is slow, then what arch and OS are you running?  I might
> have a patch for you.

also 'qstat jobid1 jobid2 jobid3 jobid4' is fast.

> strace your pbs_server process.  Is it scrolling system calls really
> really fast?  strace it with -r, is it spending more than a second in
> any system calls?

strace also shows a fast response, so I suppose that this is not the
problem. Then again, right now it's working properly.

Just for info, here are the server settings in qmgr:
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 60
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server node_pack = False
set server job_stat_rate = 30

cheers and thanks for the help,
gianfranco




More information about the torqueusers mailing list