[torqueusers] pbs_scheduler keeps dying
gs at hep.ucl.ac.uk
Mon Oct 24 04:07:00 MDT 2005
I've kept quiet as the scheduler no longer crashed since running under
valgrind. But I'll respond to your latest queries about possible server
On Wed, 2005-10-19 at 19:33, Garrick Staples wrote:
> On Wed, Oct 19, 2005 at 05:57:51PM +0100, gianfranco sciacca alleged:
> > Ronny wrote:
> > > had the same problem - the scheduler was dying within its operation.
> > > Try starting pbs_sched with option -a <TIMEOUT>.
> > > This will increase the alarm time for one scheduler run to <TIMEOUT>
> > > seconds. I use -a 600.
> > I'll roll this in if it dies again and will let you know.
> This implies that pbs_server is being slow and unresponsive. While we
> don't want to ignore a crash bug in pbs_sched, it is worth looking at
> latency problems in pbs_server.
> Do you find that 'qstat' is slow to respond? If 'qstat' is slow, but
> 'pbsnodes -a' is fast, then be sure poll_jobs is True in qmgr.
both respond fast.
> If 'qstat' is fast, but 'qstat jobid1 jobid2 jobid3 jobid4' (substitute
> real jobids) is slow, then what arch and OS are you running? I might
> have a patch for you.
also 'qstat jobid1 jobid2 jobid3 jobid4' is fast.
> strace your pbs_server process. Is it scrolling system calls really
> really fast? strace it with -r, is it spending more than a second in
> any system calls?
strace also shows a fast response, so I suppose that this is not the
problem. Then again, right now it's working properly.
Just for info, here are the server settings in qmgr:
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 60
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server node_pack = False
set server job_stat_rate = 30
cheers and thanks for the help,
More information about the torqueusers