[Mauiusers] Maui is unexpectedly down

Lennart Karlsson Lennart.Karlsson at nsc.liu.se
Fri Aug 19 05:04:51 MDT 2005


Dear Jung Oh,

You answered a letter by Gordon with the following lines, August 16th:
> Thank you for your help.
> As you wrote down, I changed some confi. in maui.cfg file as follows.
> Then maui works find till now.
> 
> _______________________________
> RMCFG[head node] TIMEOUT=30
> JOBAGGREGATIONTIME    00:00:10
> RMPOLLINTERVAL        00:02:30
> 
> LOG LEVEL                  9 (advised by Mr. Garrick)
> ------------------------------------------
> 
> But I cannot find any explanation or guidelines of
> 'RMCFG[] TIMEOUT' and 'JOBAGGREGATIONTIME'  variables
> in admi~.pdf file nor web site.
> 
> In my opinition, most important variable is the 'RMPOLLINTERVAL'.
> 
> I really appreciate your help.


I also had great help from Gordon's configuration. (Thank you!)

Maui, at least in versions 3.2.6p11 and 3.2.6p13, handles Torque timeouts
badly. We are using preemtion and when Maui has ordered Torque to
requeue a preemptee, Maui at once afterwards orders Torque to start the
preemptor. The later call to Torque times out and I can find the log line
	ERROR:    cannot get node info: NULL
in the Maui log. When running version 3.2.6p13, Maui crashes with the
mentioned line as the last log line at LOGLEVEL 9. When running version p11,
Maui does not crash, but is not able to run the preemptor without problems
like first HOLDing the preemptor.

I am now happy to see that the line
	RMCFG[base]             TIMEOUT=90
in the Maui configuration file seems to help me out. (Probably 30 would
be sufficient in most cases, but I want a margin here.)

I read that  CRI is working on a SEGFAULT fix and hope that this fix also
solves the 3.2.6p13 crashes.

TIMEOUT is explained in web page
http://www.clusterresources.com/products/maui/docs/13.2rmconfiguration.shtml

It says that the default TIMEOUT is 15 seconds, which is too low for
the Maui-Torque combination.

JOBAGGREGATIONTIME is explained in web page
http://www.clusterresources.com/products/maui/docs/a.fparameters.shtml

Thanks,
-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
   National Supercomputer Centre in Linkoping, Sweden
   http://www.nsc.liu.se
   +46 706 49 55 35
   +46 13 28 26 24




More information about the mauiusers mailing list