[torqueusers] Torque/Maui Crashing or Pausing

Lennart Karlsson Lennart.Karlsson at nsc.liu.se
Thu Apr 20 02:08:54 MDT 2006


Garric Staples wrote:
> On Wed, Apr 19, 2006 at 09:34:59AM -0700, Austin Godber alleged:
> > Within the last few weeks I have been seeing my AMD cluster (using
> > torque and maui) pause and then eventually restart.  Looking more
> > closely it appear that torque briefly dies (qstat cant talk to the
> > server) then starts running again.  This seems to cause maui to hang for
> > 15-30 minutes (showq and pals don't work).  Then miraculously maui
> > starts scheduling again and all is well.
> > 
> > For what its worth, I think I can force it to happen by doing an
> > interactive qsub like this:
> > 	qsub -I -v DISPLAY=desktop.host.com:0.0 -q x86_64
> > although I am not certain as testing it is fairly disruptive.  But it
> > definately happens under other circumstances.
> > 
> > 
> > I have attached torque and maui logs.  Maui is maui-3.2.6p13 and torque
> > is torque-2.0.0p0.  I did not disable rpp.
> 
> Things like this mostly happen with slow responses from MOMs.  The
> situation is much improved in later versions of torque.  Be sure you
> have poll_jobs enabled, and try to reproduce with the current versions
> of torque and maui (don't forget to build new maui after new torque is
> installed.)


Very slow responses from MOMs (or pbs_server) give timeouts in the
communication. The quick solution to the problem is to set a higher
timeout value in Maui, like

RMCFG[base]             TIMEOUT=90

(if you are using the 'base' name in your RMCFG configuration), but Garrick's
solution is much better as soon as you can do the upgrade.

-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
   National Supercomputer Centre in Linkoping, Sweden
   http://www.nsc.liu.se




More information about the torqueusers mailing list