[torqueusers] Torque/Maui Crashing or Pausing
Lennart.Karlsson at nsc.liu.se
Thu Apr 20 02:08:54 MDT 2006
Garric Staples wrote:
> On Wed, Apr 19, 2006 at 09:34:59AM -0700, Austin Godber alleged:
> > Within the last few weeks I have been seeing my AMD cluster (using
> > torque and maui) pause and then eventually restart. Looking more
> > closely it appear that torque briefly dies (qstat cant talk to the
> > server) then starts running again. This seems to cause maui to hang for
> > 15-30 minutes (showq and pals don't work). Then miraculously maui
> > starts scheduling again and all is well.
> > For what its worth, I think I can force it to happen by doing an
> > interactive qsub like this:
> > qsub -I -v DISPLAY=desktop.host.com:0.0 -q x86_64
> > although I am not certain as testing it is fairly disruptive. But it
> > definately happens under other circumstances.
> > I have attached torque and maui logs. Maui is maui-3.2.6p13 and torque
> > is torque-2.0.0p0. I did not disable rpp.
> Things like this mostly happen with slow responses from MOMs. The
> situation is much improved in later versions of torque. Be sure you
> have poll_jobs enabled, and try to reproduce with the current versions
> of torque and maui (don't forget to build new maui after new torque is
Very slow responses from MOMs (or pbs_server) give timeouts in the
communication. The quick solution to the problem is to set a higher
timeout value in Maui, like
(if you are using the 'base' name in your RMCFG configuration), but Garrick's
solution is much better as soon as you can do the upgrade.
-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
National Supercomputer Centre in Linkoping, Sweden
More information about the torqueusers