[Mauiusers] maui & old gcc optimizer bug.

David B Jackson jacksond at clusterresources.com
Sat Nov 19 22:25:02 MST 2005


Chris,

  In newer versions of TORQUE, (2.0.1+) the momctl command can be used to
identify/report on prolog timeouts which may result in this failure.  To
see this output, use 'momctl -d 3'

Dave


> On Wed, Nov 16, 2005 at 09:13:59AM -0500, Chris Johnson alleged:
>>      Hi all,
>>
>>      Have a little annoyance here which is driving me up the wall.  I
>> have two mini maui cluster with torque running.  The well behaved one
>> is on CentOS 4.1 with opteron architecture.  The less than ideal one
>> is on FC2 with P-III hardware.
>>
>>      The one on the opterons is running terrific.
>>
>>      The one on the P-III's gives me maui.log errors like this
>>
>> (rc: 15041  errmsg: 'Execution server rejected request MSG=send failed,
>> STARTING'  hostlist: 'node15')
>
> The "Execution server" in this case is node15.  Off the top of my head,
> I'd say the most likely cause is a long-running prologue.
>
> As Chris mentioned, this probably has nothing to do with Maui.  What
> version of TORQUE?
>
> In 1.2.0p1 we added the $jobstartblocktime MOM config parameter.  It
> specifies the number of seconds pbs_mom will block on the initial
> attempt to start the job.  After jobstartblocktime seconds, pbs_mom
> returns "ask me again later" back to pbs_server.  Unfortunately,
> pbs_server is also blocked during that time.  The default is 5 seconds.
>
> I set mine to 0 for fully non-blocking job startup.  IMHO, 0 is the
> ideal value but I don't think it is widely tested outside of my cluster.
> You can try higher or lower values on-the-fly with
> 'momctl -h node15 -q jobstartblocktime=X'.
>
>
> --
> Garrick Staples, Linux/HPCC Administrator
> University of Southern California
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
>



More information about the mauiusers mailing list