[Mauiusers] maui & old gcc optimizer bug.

Garrick Staples garrick at usc.edu
Sat Nov 19 14:20:09 MST 2005


On Wed, Nov 16, 2005 at 09:13:59AM -0500, Chris Johnson alleged:
>      Hi all,
> 
>      Have a little annoyance here which is driving me up the wall.  I
> have two mini maui cluster with torque running.  The well behaved one
> is on CentOS 4.1 with opteron architecture.  The less than ideal one
> is on FC2 with P-III hardware.
> 
>      The one on the opterons is running terrific.
> 
>      The one on the P-III's gives me maui.log errors like this
> 
> (rc: 15041  errmsg: 'Execution server rejected request MSG=send failed, 
> STARTING'  hostlist: 'node15')

The "Execution server" in this case is node15.  Off the top of my head,
I'd say the most likely cause is a long-running prologue.  

As Chris mentioned, this probably has nothing to do with Maui.  What
version of TORQUE?

In 1.2.0p1 we added the $jobstartblocktime MOM config parameter.  It
specifies the number of seconds pbs_mom will block on the initial
attempt to start the job.  After jobstartblocktime seconds, pbs_mom
returns "ask me again later" back to pbs_server.  Unfortunately,
pbs_server is also blocked during that time.  The default is 5 seconds.

I set mine to 0 for fully non-blocking job startup.  IMHO, 0 is the
ideal value but I don't think it is widely tested outside of my cluster.
You can try higher or lower values on-the-fly with 
'momctl -h node15 -q jobstartblocktime=X'. 


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20051119/98a8ce42/attachment.bin


More information about the mauiusers mailing list