[Mauiusers] Possible Memory Corruption in maui

Dr. Stephan Raub raub at uni-duesseldorf.de
Wed Nov 9 12:39:29 MST 2011


Hello,


> However, I'm not entirely convinced this is a
> maui bug as the last calls before the libc calls are ones through the
> torque library.

I totally agree. We dived into the code of maui and found out, that the
error occurs while calling "pbs_statnode()" (MPBSI.c, line 1268). The
"memory corruption" seems to be thrown not in maui but in the called
torque-function PBSD_status_get() (which is called by PBSD_status()) in
PBSD_status.c. Currently, we assume an error in building the (struct
batch_status) *next entries of this list.

It seems, I have to apologize for bothering the maui list with this problem.
;-) Thank you for all of you for your comments and suggestions. It
eventually has lead us in the right direction.

Best regards

Stephan
--
---------------------------------------------------------
| | Dr. rer. nat. Stephan Raub
| | Dipl. Chem.
| | High-Performance-Computing
| | Zentrum für Informations- und Medientechnologie 
| | Heinrich-Heine-Universität Düsseldorf
| | Universitätsstr. 1 / Raum 25.41.O2.25-2
| | 40225 Düsseldorf / Germany
| |
| | Tel: +49-211-811-3911
| | Fax: +49-211-811-2539
---------------------------------------------------------

Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Geschäftsgeheimnisse,
bzw. 
sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail
irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine
Vervielfältigung oder Weitergabe der E-Mail ausdrücklich untersagt. Bitte
benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen
Dank.

Important Note: This e-mail may contain trade secrets or privileged,
undisclosed or otherwise confidential information. If you have received this
e-mail in error, you are hereby notified that any review, copying or
distribution of it is strictly prohibited. Please inform us immediately and
destroy the original transmittal. Thank you for your cooperation.


> -----Ursprüngliche Nachricht-----
> Von: mauiusers-bounces at supercluster.org [mailto:mauiusers-
> bounces at supercluster.org] Im Auftrag von Jason Williams
> Gesendet: Mittwoch, 9. November 2011 15:26
> An: mauiusers at supercluster.org
> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui
> 
> Dr. Stephan Raub and Joerg Blank,
> 
> It looks like the binaries you all have are stripped of their debug
> symbols which is going to make my idea of tracing the crash in the maui
> code next to impossible.  However, I'm not entirely convinced this is a
> maui bug as the last calls before the libc calls are ones through the
> torque library.
> 
> Joerg:  What version of Torque do you have?
> 
> I think my next step here would be to either a) load any -debuginfo rpm
> if you installed it via RPM and try running with the -d again to
> hopefully get some debug info in the back trace b) try running maui -d
> through gdb and see if you can get some useful information there and/or
> c) if you compiled it from source, disable stripping the debug symbols
> and recompile it to try to get some more information in the backtrace.
> With out some useful information as to where in the maui binary things
> are when the crash happens, I can't start looking to see what happened.
> 
> --
> Jason
> 
> On 11/9/2011 4:38 AM, Dr. Stephan Raub wrote:
> > Dear Jason Williams,
> >
> > thank you for your hint. Please, find below the result of our Maui
> > running with the "-d" command line option (maui was running about 5
> > minutes before it crashed):
> >
> > # /usr/local/maui/sbin/maui -d
> > *** glibc detected *** /usr/local/maui/sbin/maui: malloc(): memory
> > corruption: 0x00000000099243e0 ***
> > ======= Backtrace: =========
> > /lib64/libc.so.6[0x3300672fae]
> > /lib64/libc.so.6(__libc_malloc+0x6e)[0x3300674cde]
> >
> /usr/local/torque/lib/libtorque.so.2(decode_DIS_replyCmd+0x266)[0x2ab2
> > 78cb18
> > e6]
> > /usr/local/torque/lib/libtorque.so.2(PBSD_rdrpy+0x80)[0x2ab278cb56d0]
> >
> /usr/local/torque/lib/libtorque.so.2(PBSD_status_get+0x26)[0x2ab278cb6
> > 786]
> > /usr/local/maui/sbin/maui[0x4d9e59]
> > /usr/local/maui/sbin/maui[0x48b8e4]
> > /usr/local/maui/sbin/maui[0x48b84f]
> > /usr/local/maui/sbin/maui[0x4ce81c]
> > /usr/local/maui/sbin/maui[0x4ce39e]
> > /usr/local/maui/sbin/maui[0x4419eb]
> > /usr/local/maui/sbin/maui[0x403608]
> > /lib64/libc.so.6(__libc_start_main+0xf4)[0x330061d994]
> > /usr/local/maui/sbin/maui[0x402cd9]
> > ======= Memory map: ========
> > 00400000-0054f000 r-xp 00000000 08:03 50266128
> > /usr/local/maui/sbin/maui 0074f000-00754000 rw-p 0014f000 08:03
> > 50266128 /usr/local/maui/sbin/maui 00754000-02344000 rw-p 00754000
> > 00:00 0 0984b000-188f1000 rw-p 0984b000 00:00 0 [heap]
> > 3300200000-330021c000 r-xp 00000000 08:03 18186265 /lib64/ld-2.5.so
> > 330041b000-330041c000 r--p 0001b000 08:03 18186265 /lib64/ld-2.5.so
> > 330041c000-330041d000 rw-p 0001c000 08:03 18186265 /lib64/ld-2.5.so
> > 3300600000-330074e000 r-xp 00000000 08:03 18186304 /lib64/libc-2.5.so
> > 330074e000-330094d000 ---p 0014e000 08:03 18186304 /lib64/libc-2.5.so
> > 330094d000-3300951000 r--p 0014d000 08:03 18186304 /lib64/libc-2.5.so
> > 3300951000-3300952000 rw-p 00151000 08:03 18186304 /lib64/libc-2.5.so
> > 3300952000-3300957000 rw-p 3300952000 00:00 0 3300a00000-3300a02000
> > r-xp 00000000 08:03 18186457 /lib64/libdl-2.5.so 3300a02000-
> 3300c02000
> > ---p 00002000 08:03 18186457 /lib64/libdl-2.5.so 3300c02000-
> 3300c03000
> > r--p 00002000 08:03 18186457 /lib64/libdl-2.5.so 3300c03000-
> 3300c04000
> > rw-p 00003000 08:03 18186457 /lib64/libdl-2.5.so 3300e00000-
> 3300e82000
> > r-xp 00000000 08:03 18186543 /lib64/libm-2.5.so 3300e82000-3301081000
> > ---p 00082000 08:03 18186543 /lib64/libm-2.5.so 3301081000-3301082000
> > r--p 00081000 08:03 18186543 /lib64/libm-2.5.so 3301082000-3301083000
> > rw-p 00082000 08:03 18186543 /lib64/libm-2.5.so 3303a00000-3303a0d000
> > r-xp 00000000 08:03 18186545
> > /lib64/libgcc_s-4.1.2-20080825.so.1
> > 3303a0d000-3303c0d000 ---p 0000d000 08:03 18186545
> > /lib64/libgcc_s-4.1.2-20080825.so.1
> > 3303c0d000-3303c0e000 rw-p 0000d000 08:03 18186545
> > /lib64/libgcc_s-4.1.2-20080825.so.1
> > 3304a00000-3304a15000 r-xp 00000000 08:03 18186491
> > /lib64/libselinux.so.1 3304a15000-3304c15000 ---p 00015000 08:03
> > 18186491 /lib64/libselinux.so.1 3304c15000-3304c17000 rw-p 00015000
> > 08:03 18186491 /lib64/libselinux.so.1 3304c17000-3304c18000 rw-p
> > 3304c17000 00:00 0 3304e00000-3304e3b000 r-xp 00000000 08:03 18186479
> > /lib64/libsepol.so.1 3304e3b000-330503b000 ---p 0003b000 08:03
> > 18186479 /lib64/libsepol.so.1 330503b000-330503c000 rw-p 0003b000
> > 08:03 18186479 /lib64/libsepol.so.1 330503c000-3305046000 rw-p
> > 330503c000 00:00 0 3305e00000-3305e02000 r-xp 00000000 08:03 18186469
> > /lib64/libkeyutils-1.3.so 3305e02000-3306001000 ---p 00002000 08:03
> > 18186469 /lib64/libkeyutils-1.3.so 3306001000-3306002000 rw-p
> 00001000
> > 08:03 18186469 /lib64/libkeyutils-1.3.so 3306200000-3306211000 r-xp
> > 00000000 08:03 18186474 /lib64/libresolv-2.5.so 3306211000-3306411000
> > ---p 00011000 08:03 18186474 /lib64/libresolv-2.5.so
> > 3306411000-3306412000 r--p 00Aborted
> >
> > Thank you for your efforts.
> >
> > Stephan
> > --
> > ---------------------------------------------------------
> > | | Dr. rer. nat. Stephan Raub
> > | | Dipl. Chem.
> > | | High-Performance-Computing
> > | | Zentrum für Informations- und Medientechnologie
> > | | Heinrich-Heine-Universität Düsseldorf Universitätsstr. 1 / Raum
> > | | 25.41.O2.25-2
> > | | 40225 Düsseldorf / Germany
> > | |
> > | | Tel: +49-211-811-3911
> > | | Fax: +49-211-811-2539
> > ---------------------------------------------------------
> >
> > Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder
> > Geschäftsgeheimnisse, bzw.
> > sonstige vertrauliche Informationen enthalten. Sollten Sie diese
> > E-Mail irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des
> > Inhalts, eine Vervielfältigung oder Weitergabe der E-Mail
> ausdrücklich
> > untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die
> > empfangene E-Mail. Vielen Dank.
> >
> > Important Note: This e-mail may contain trade secrets or privileged,
> > undisclosed or otherwise confidential information. If you have
> > received this e-mail in error, you are hereby notified that any
> > review, copying or distribution of it is strictly prohibited. Please
> > inform us immediately and destroy the original transmittal. Thank you
> for your cooperation.
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: mauiusers-bounces at supercluster.org [mailto:mauiusers-
> >> bounces at supercluster.org] Im Auftrag von Jason Williams
> >> Gesendet: Dienstag, 8. November 2011 23:50
> >> An: mauiusers at supercluster.org
> >> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui
> >>
> >> Dr Stephan Raub,
> >>
> >> Maui does have some very odd "memory management" in it that has a
> >> tendency to cause these types of crashes when run in high volume
> >> situations without some tweaks and/or concessions.  I've tracked
> >> down, and I think fixed, one in the latest svn trunk, but 3.3.1
> >> should already have that fix in it.
> >>
> >> Can/have you tried running maui from the command line with the -d
> >> line and catching the corrupt memory and back trace that comes out
> of it?
> >> Your original email has the strace, but it cuts off some of the
> >> backtrace.  I might be able to see where in the code it's having
> >> problems, if I can get the full back trace.
> >>
> >>
> >> --
> >> Jason Williams
> >> Systems Engineer
> >> Homewood High Performance Cluster
> >> Johns Hopkins University
> >>
> >> On 11/8/2011 12:09 PM, Dr. Stephan Raub wrote:
> >>> Dear Mr. van der Vlies
> >>>
> >>> Currently we have 6095 Jobs queued and 93 Jobs running. Amoung
> >>> these, we have some large job arrays (1000 and 4000 items per
> array).
> >>>
> >>> Best regards.
> >>> --
> >>> ---------------------------------------------------------
> >>> | | Dr. rer. nat. Stephan Raub
> >>> | | Dipl. Chem.
> >>> | | High-Performance-Computing
> >>> | | Zentrum für Informations- und Medientechnologie
> >>> | | Heinrich-Heine-Universität Düsseldorf Universitätsstr. 1 / Raum
> >>> | | 25.41.O2.25-2
> >>> | | 40225 Düsseldorf / Germany
> >>> | |
> >>> | | Tel: +49-211-811-3911
> >>> | | Fax: +49-211-811-2539
> >>> ---------------------------------------------------------
> >>>
> >>> Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder
> >>> Geschäftsgeheimnisse, bzw.
> >>> sonstige vertrauliche Informationen enthalten. Sollten Sie diese
> >>> E-Mail irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des
> >>> Inhalts, eine Vervielfältigung oder Weitergabe der E-Mail
> >> ausdrücklich
> >>> untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die
> >>> empfangene E-Mail. Vielen Dank.
> >>>
> >>> Important Note: This e-mail may contain trade secrets or
> privileged,
> >>> undisclosed or otherwise confidential information. If you have
> >>> received this e-mail in error, you are hereby notified that any
> >>> review, copying or distribution of it is strictly prohibited.
> Please
> >>> inform us immediately and destroy the original transmittal. Thank
> >>> you
> >> for your cooperation.
> >>>
> >>>> -----Ursprüngliche Nachricht-----
> >>>> Von: Bas van der Vlies [mailto:basv at sara.nl]
> >>>> Gesendet: Dienstag, 8. November 2011 17:10
> >>>> An: Dr. Stephan Raub
> >>>> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui
> >>>>
> >>>> On 08-11-11 16:40, Dr. Stephan Raub wrote:
> >>>>> Dear fellow maui users,
> >>>>>
> >>>>> we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5
> >>>>> (2.6.8-194.26.1.el1) on a 600-somewhat core cluster.
> >>>>>
> >>>>> We experienced a sudden death of the maui scheduler with no
> >>>>> message
> >>>> in the
> >>>>> logs. We could not figure out a reason so we attached an "strace"
> >> to
> >>>> the
> >>>>> maui process (as long as it was "still alive") and we got:
> >>>>>
> >>>> Dear Dr. Stephan Raub,
> >>>>
> >>>> just a question: How many jobs are in the queue?
> >>>>
> >>>> regards
> >>>>
> >>>>
> >>>> --
> >>>>
> ********************************************************************
> >>>> *  Bas van der Vlies                    e-mail: basv at sara.nl
> *
> >>>> *  SARA - Academic Computing Services   Amsterdam, The Netherlands
> *
> >>>>
> *******************************************************************
> >>>> *
> >>>
> >>> _______________________________________________
> >>> mauiusers mailing list
> >>> mauiusers at supercluster.org
> >>> http://www.supercluster.org/mailman/listinfo/mauiusers
> >> _______________________________________________
> >> mauiusers mailing list
> >> mauiusers at supercluster.org
> >> http://www.supercluster.org/mailman/listinfo/mauiusers
> >
> > _______________________________________________
> > mauiusers mailing list
> > mauiusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/mauiusers
> 
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers




More information about the mauiusers mailing list