[Mauiusers] Maui dies

Jeffery Ludwig ludwig at che.udel.edu
Wed Feb 9 13:29:27 MST 2005


Sorry for the undescriptive subject line, I'm hoping the mail software 
might thread this properly for future searches. This is a continuation 
of a thread from this September regarding the scheduler "randomly" 
dying.  See:

http://www.supercluster.org/pipermail/mauiusers/2004-September/001338.html

We've recently upgraded our scheduling software from the OpenPBS default 
to Maui on two separate Beowulf clusters I administer.  The first, 
running SUSE 9.1, Linux 2.6.5-7.111.5-default and OpenPBS server (as 
provided by SUSE) has been working flawlessly for a week, the pbs_server 
was not even shut down to make the change.

The second cluster has to this point been riddled with problems, I think 
I can provide some information via a compare and contrast to help 
isolate this bug if still unresolved.  The second cluster is running 
Redhat 7.3, Linux 2.4.20-28.7.  Originally this cluster was running 
OpenPBS 2.3.16/maui-3.2.6p11 stock, now running torque-1.2.0p0/maui, 
problems occurred with both setups.  We are using LAM 6.5.9/MPI 2 for 
the most part, the stable cluster is using LAM 7.0.6/MPI 2 C++/ROMIO. 
It appears the cluster using LAM 7.0.6 is accounting for usage in 
Fairshare properly, while the 6.5.9 is not.

Key config difference between the setups, as noted previously in the 
thread, is that the problematic cluster is using standing reservations, 
while the other is not.

Based on our logs, maui seemed to have problems as it was dumping 
parallel jobs due to a wallclock violation. It is unable (or torque is 
rather) to actually kill jobs running with LAM-MPI (or it accounts for 
the time improperly); they remain running even after maui thinks they 
are gone.  Since issuing:

set queue dque resources_default.walltime = 2376:00:00

Everything seems much more stable, save fairshare values being off, I 
have my fingers crossed right now. Sorry for the brain dump... if it 
does crash again, i will run it through gdb to get a stack trace.  Both 
are "production" machines so I'm really not free to test too much...

-- 
Jeffery Ludwig                                          (302)-831-2345
Research Assistant                            Dept of Chem Engineering
Center for Catalytic Science and Technology           235 Colburn Labs
University of Delaware                                Newark, DE 19716


More information about the mauiusers mailing list