[Mauiusers] Maui dies
ludwig at che.udel.edu
Wed Feb 9 13:29:27 MST 2005
Sorry for the undescriptive subject line, I'm hoping the mail software
might thread this properly for future searches. This is a continuation
of a thread from this September regarding the scheduler "randomly"
We've recently upgraded our scheduling software from the OpenPBS default
to Maui on two separate Beowulf clusters I administer. The first,
running SUSE 9.1, Linux 2.6.5-7.111.5-default and OpenPBS server (as
provided by SUSE) has been working flawlessly for a week, the pbs_server
was not even shut down to make the change.
The second cluster has to this point been riddled with problems, I think
I can provide some information via a compare and contrast to help
isolate this bug if still unresolved. The second cluster is running
Redhat 7.3, Linux 2.4.20-28.7. Originally this cluster was running
OpenPBS 2.3.16/maui-3.2.6p11 stock, now running torque-1.2.0p0/maui,
problems occurred with both setups. We are using LAM 6.5.9/MPI 2 for
the most part, the stable cluster is using LAM 7.0.6/MPI 2 C++/ROMIO.
It appears the cluster using LAM 7.0.6 is accounting for usage in
Fairshare properly, while the 6.5.9 is not.
Key config difference between the setups, as noted previously in the
thread, is that the problematic cluster is using standing reservations,
while the other is not.
Based on our logs, maui seemed to have problems as it was dumping
parallel jobs due to a wallclock violation. It is unable (or torque is
rather) to actually kill jobs running with LAM-MPI (or it accounts for
the time improperly); they remain running even after maui thinks they
are gone. Since issuing:
set queue dque resources_default.walltime = 2376:00:00
Everything seems much more stable, save fairshare values being off, I
have my fingers crossed right now. Sorry for the brain dump... if it
does crash again, i will run it through gdb to get a stack trace. Both
are "production" machines so I'm really not free to test too much...
Jeffery Ludwig (302)-831-2345
Research Assistant Dept of Chem Engineering
Center for Catalytic Science and Technology 235 Colburn Labs
University of Delaware Newark, DE 19716
More information about the mauiusers