[Mauiusers] maui & old gcc optimizer bug.
David B Jackson
jacksond at clusterresources.com
Sat Nov 19 22:25:02 MST 2005
Chris,
In newer versions of TORQUE, (2.0.1+) the momctl command can be used to
identify/report on prolog timeouts which may result in this failure. To
see this output, use 'momctl -d 3'
Dave
> On Wed, Nov 16, 2005 at 09:13:59AM -0500, Chris Johnson alleged:
>> Hi all,
>>
>> Have a little annoyance here which is driving me up the wall. I
>> have two mini maui cluster with torque running. The well behaved one
>> is on CentOS 4.1 with opteron architecture. The less than ideal one
>> is on FC2 with P-III hardware.
>>
>> The one on the opterons is running terrific.
>>
>> The one on the P-III's gives me maui.log errors like this
>>
>> (rc: 15041 errmsg: 'Execution server rejected request MSG=send failed,
>> STARTING' hostlist: 'node15')
>
> The "Execution server" in this case is node15. Off the top of my head,
> I'd say the most likely cause is a long-running prologue.
>
> As Chris mentioned, this probably has nothing to do with Maui. What
> version of TORQUE?
>
> In 1.2.0p1 we added the $jobstartblocktime MOM config parameter. It
> specifies the number of seconds pbs_mom will block on the initial
> attempt to start the job. After jobstartblocktime seconds, pbs_mom
> returns "ask me again later" back to pbs_server. Unfortunately,
> pbs_server is also blocked during that time. The default is 5 seconds.
>
> I set mine to 0 for fully non-blocking job startup. IMHO, 0 is the
> ideal value but I don't think it is widely tested outside of my cluster.
> You can try higher or lower values on-the-fly with
> 'momctl -h node15 -q jobstartblocktime=X'.
>
>
> --
> Garrick Staples, Linux/HPCC Administrator
> University of Southern California
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
>
More information about the mauiusers
mailing list