[Mauiusers] Maui is unexpectedly down
Lennart.Karlsson at nsc.liu.se
Fri Aug 19 05:04:51 MDT 2005
Dear Jung Oh,
You answered a letter by Gordon with the following lines, August 16th:
> Thank you for your help.
> As you wrote down, I changed some confi. in maui.cfg file as follows.
> Then maui works find till now.
> RMCFG[head node] TIMEOUT=30
> JOBAGGREGATIONTIME 00:00:10
> RMPOLLINTERVAL 00:02:30
> LOG LEVEL 9 (advised by Mr. Garrick)
> But I cannot find any explanation or guidelines of
> 'RMCFG TIMEOUT' and 'JOBAGGREGATIONTIME' variables
> in admi~.pdf file nor web site.
> In my opinition, most important variable is the 'RMPOLLINTERVAL'.
> I really appreciate your help.
I also had great help from Gordon's configuration. (Thank you!)
Maui, at least in versions 3.2.6p11 and 3.2.6p13, handles Torque timeouts
badly. We are using preemtion and when Maui has ordered Torque to
requeue a preemptee, Maui at once afterwards orders Torque to start the
preemptor. The later call to Torque times out and I can find the log line
ERROR: cannot get node info: NULL
in the Maui log. When running version 3.2.6p13, Maui crashes with the
mentioned line as the last log line at LOGLEVEL 9. When running version p11,
Maui does not crash, but is not able to run the preemptor without problems
like first HOLDing the preemptor.
I am now happy to see that the line
in the Maui configuration file seems to help me out. (Probably 30 would
be sufficient in most cases, but I want a margin here.)
I read that CRI is working on a SEGFAULT fix and hope that this fix also
solves the 3.2.6p13 crashes.
TIMEOUT is explained in web page
It says that the default TIMEOUT is 15 seconds, which is too low for
the Maui-Torque combination.
JOBAGGREGATIONTIME is explained in web page
-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
National Supercomputer Centre in Linkoping, Sweden
+46 706 49 55 35
+46 13 28 26 24
More information about the mauiusers