[torqueusers] Torque/Maui Crashing or Pausing

Garrick Staples garrick at usc.edu
Wed Apr 19 11:26:33 MDT 2006


On Wed, Apr 19, 2006 at 09:34:59AM -0700, Austin Godber alleged:
> Within the last few weeks I have been seeing my AMD cluster (using
> torque and maui) pause and then eventually restart.  Looking more
> closely it appear that torque briefly dies (qstat cant talk to the
> server) then starts running again.  This seems to cause maui to hang for
> 15-30 minutes (showq and pals don't work).  Then miraculously maui
> starts scheduling again and all is well.
> 
> For what its worth, I think I can force it to happen by doing an
> interactive qsub like this:
> 	qsub -I -v DISPLAY=desktop.host.com:0.0 -q x86_64
> although I am not certain as testing it is fairly disruptive.  But it
> definately happens under other circumstances.
> 
> 
> I have attached torque and maui logs.  Maui is maui-3.2.6p13 and torque
> is torque-2.0.0p0.  I did not disable rpp.

Things like this mostly happen with slow responses from MOMs.  The
situation is much improved in later versions of torque.  Be sure you
have poll_jobs enabled, and try to reproduce with the current versions
of torque and maui (don't forget to build new maui after new torque is
installed.)


> Thanks for any help you can provide.
> 
> Austin
> 
> Torque Error
> ============
> 
> 04/13/2006 13:06:19;0001;PBS_Server;Svr;PBS_Server;Invalid argument (22)
> in wait_request, select failed
> 04/13/2006 13:06:19;0001;PBS_Server;Svr;PBS_Server;PBS_Server,
> wait_request failed
> 04/13/2006 13:06:19;0002;PBS_Server;Req;dis_reply_write;DIS reply
> failure, -1
> 04/13/2006 13:06:19;0002;PBS_Server;Req;dis_reply_write;DIS reply
> failure, -1
> 04/13/2006 13:06:19;0002;PBS_Server;Req;dis_reply_write;DIS reply
> failure, -1
> 04/13/2006 13:06:19;0002;PBS_Server;Req;dis_reply_write;DIS reply
> failure, -1
> 04/13/2006 13:06:19;0002;PBS_Server;Req;dis_reply_write;DIS reply
> failure, -1
> 04/13/2006 13:21:20;0001;PBS_Server;Svr;PBS_Server;Success (0) in
> wait_request, timeout connection from 149.169.147.253 (31 seconds)
> 04/13/2006 13:21:20;0001;PBS_Server;Svr;PBS_Server;Success (0) in
> wait_request, timeout connection from 149.169.147.253 (31 seconds)
> 
> Maui Error
> ==========
> 
> 04/13 13:05:42 MPBSClusterQuery(COLOSSUS.MARS.ASU.EDU,RCount,SC)
> 04/13 13:05:51 ERROR:    cannot get node info: Premature end of message
> 04/13 13:21:20 WARNING:  no resources detected
> 04/13 13:21:20 MPBSWorkloadQuery(COLOSSUS.MARS.ASU.EDU,JCount,SC)
> 04/13 13:21:20 MPBSInitialize(COLOSSUS.MARS.ASU.EDU,SC)
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060419/f62f0144/attachment.bin


More information about the torqueusers mailing list