[Mauiusers] Possible Memory Corruption in maui

Jason Williams jasonw at Jhu.edu
Wed Nov 9 07:26:19 MST 2011


Dr. Stephan Raub and Joerg Blank,

It looks like the binaries you all have are stripped of their debug 
symbols which is going to make my idea of tracing the crash in the maui 
code next to impossible.  However, I'm not entirely convinced this is a 
maui bug as the last calls before the libc calls are ones through the 
torque library.

Joerg:  What version of Torque do you have?

I think my next step here would be to either a) load any -debuginfo rpm 
if you installed it via RPM and try running with the -d again to 
hopefully get some debug info in the back trace b) try running maui -d 
through gdb and see if you can get some useful information there and/or 
c) if you compiled it from source, disable stripping the debug symbols 
and recompile it to try to get some more information in the backtrace.  
With out some useful information as to where in the maui binary things 
are when the crash happens, I can't start looking to see what happened.

--
Jason

On 11/9/2011 4:38 AM, Dr. Stephan Raub wrote:
> Dear Jason Williams,
>
> thank you for your hint. Please, find below the result of our Maui running
> with the "-d" command line option (maui was running about 5 minutes before
> it crashed):
>
> # /usr/local/maui/sbin/maui -d
> *** glibc detected *** /usr/local/maui/sbin/maui: malloc(): memory
> corruption: 0x00000000099243e0 ***
> ======= Backtrace: =========
> /lib64/libc.so.6[0x3300672fae]
> /lib64/libc.so.6(__libc_malloc+0x6e)[0x3300674cde]
> /usr/local/torque/lib/libtorque.so.2(decode_DIS_replyCmd+0x266)[0x2ab278cb18
> e6]
> /usr/local/torque/lib/libtorque.so.2(PBSD_rdrpy+0x80)[0x2ab278cb56d0]
> /usr/local/torque/lib/libtorque.so.2(PBSD_status_get+0x26)[0x2ab278cb6786]
> /usr/local/maui/sbin/maui[0x4d9e59]
> /usr/local/maui/sbin/maui[0x48b8e4]
> /usr/local/maui/sbin/maui[0x48b84f]
> /usr/local/maui/sbin/maui[0x4ce81c]
> /usr/local/maui/sbin/maui[0x4ce39e]
> /usr/local/maui/sbin/maui[0x4419eb]
> /usr/local/maui/sbin/maui[0x403608]
> /lib64/libc.so.6(__libc_start_main+0xf4)[0x330061d994]
> /usr/local/maui/sbin/maui[0x402cd9]
> ======= Memory map: ========
> 00400000-0054f000 r-xp 00000000 08:03 50266128 /usr/local/maui/sbin/maui
> 0074f000-00754000 rw-p 0014f000 08:03 50266128 /usr/local/maui/sbin/maui
> 00754000-02344000 rw-p 00754000 00:00 0
> 0984b000-188f1000 rw-p 0984b000 00:00 0 [heap]
> 3300200000-330021c000 r-xp 00000000 08:03 18186265 /lib64/ld-2.5.so
> 330041b000-330041c000 r--p 0001b000 08:03 18186265 /lib64/ld-2.5.so
> 330041c000-330041d000 rw-p 0001c000 08:03 18186265 /lib64/ld-2.5.so
> 3300600000-330074e000 r-xp 00000000 08:03 18186304 /lib64/libc-2.5.so
> 330074e000-330094d000 ---p 0014e000 08:03 18186304 /lib64/libc-2.5.so
> 330094d000-3300951000 r--p 0014d000 08:03 18186304 /lib64/libc-2.5.so
> 3300951000-3300952000 rw-p 00151000 08:03 18186304 /lib64/libc-2.5.so
> 3300952000-3300957000 rw-p 3300952000 00:00 0
> 3300a00000-3300a02000 r-xp 00000000 08:03 18186457 /lib64/libdl-2.5.so
> 3300a02000-3300c02000 ---p 00002000 08:03 18186457 /lib64/libdl-2.5.so
> 3300c02000-3300c03000 r--p 00002000 08:03 18186457 /lib64/libdl-2.5.so
> 3300c03000-3300c04000 rw-p 00003000 08:03 18186457 /lib64/libdl-2.5.so
> 3300e00000-3300e82000 r-xp 00000000 08:03 18186543 /lib64/libm-2.5.so
> 3300e82000-3301081000 ---p 00082000 08:03 18186543 /lib64/libm-2.5.so
> 3301081000-3301082000 r--p 00081000 08:03 18186543 /lib64/libm-2.5.so
> 3301082000-3301083000 rw-p 00082000 08:03 18186543 /lib64/libm-2.5.so
> 3303a00000-3303a0d000 r-xp 00000000 08:03 18186545
> /lib64/libgcc_s-4.1.2-20080825.so.1
> 3303a0d000-3303c0d000 ---p 0000d000 08:03 18186545
> /lib64/libgcc_s-4.1.2-20080825.so.1
> 3303c0d000-3303c0e000 rw-p 0000d000 08:03 18186545
> /lib64/libgcc_s-4.1.2-20080825.so.1
> 3304a00000-3304a15000 r-xp 00000000 08:03 18186491 /lib64/libselinux.so.1
> 3304a15000-3304c15000 ---p 00015000 08:03 18186491 /lib64/libselinux.so.1
> 3304c15000-3304c17000 rw-p 00015000 08:03 18186491 /lib64/libselinux.so.1
> 3304c17000-3304c18000 rw-p 3304c17000 00:00 0
> 3304e00000-3304e3b000 r-xp 00000000 08:03 18186479 /lib64/libsepol.so.1
> 3304e3b000-330503b000 ---p 0003b000 08:03 18186479 /lib64/libsepol.so.1
> 330503b000-330503c000 rw-p 0003b000 08:03 18186479 /lib64/libsepol.so.1
> 330503c000-3305046000 rw-p 330503c000 00:00 0
> 3305e00000-3305e02000 r-xp 00000000 08:03 18186469 /lib64/libkeyutils-1.3.so
> 3305e02000-3306001000 ---p 00002000 08:03 18186469 /lib64/libkeyutils-1.3.so
> 3306001000-3306002000 rw-p 00001000 08:03 18186469 /lib64/libkeyutils-1.3.so
> 3306200000-3306211000 r-xp 00000000 08:03 18186474 /lib64/libresolv-2.5.so
> 3306211000-3306411000 ---p 00011000 08:03 18186474 /lib64/libresolv-2.5.so
> 3306411000-3306412000 r--p 00Aborted
>
> Thank you for your efforts.
>
> Stephan
> --
> ---------------------------------------------------------
> | | Dr. rer. nat. Stephan Raub
> | | Dipl. Chem.
> | | High-Performance-Computing
> | | Zentrum für Informations- und Medientechnologie
> | | Heinrich-Heine-Universität Düsseldorf
> | | Universitätsstr. 1 / Raum 25.41.O2.25-2
> | | 40225 Düsseldorf / Germany
> | |
> | | Tel: +49-211-811-3911
> | | Fax: +49-211-811-2539
> ---------------------------------------------------------
>
> Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Geschäftsgeheimnisse,
> bzw.
> sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail
> irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine
> Vervielfältigung oder Weitergabe der E-Mail ausdrücklich untersagt. Bitte
> benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen
> Dank.
>
> Important Note: This e-mail may contain trade secrets or privileged,
> undisclosed or otherwise confidential information. If you have received this
> e-mail in error, you are hereby notified that any review, copying or
> distribution of it is strictly prohibited. Please inform us immediately and
> destroy the original transmittal. Thank you for your cooperation.
>
>> -----Ursprüngliche Nachricht-----
>> Von: mauiusers-bounces at supercluster.org [mailto:mauiusers-
>> bounces at supercluster.org] Im Auftrag von Jason Williams
>> Gesendet: Dienstag, 8. November 2011 23:50
>> An: mauiusers at supercluster.org
>> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui
>>
>> Dr Stephan Raub,
>>
>> Maui does have some very odd "memory management" in it that has a
>> tendency to cause these types of crashes when run in high volume
>> situations without some tweaks and/or concessions.  I've tracked down,
>> and I think fixed, one in the latest svn trunk, but 3.3.1 should
>> already have that fix in it.
>>
>> Can/have you tried running maui from the command line with the -d line
>> and catching the corrupt memory and back trace that comes out of it?
>> Your original email has the strace, but it cuts off some of the
>> backtrace.  I might be able to see where in the code it's having
>> problems, if I can get the full back trace.
>>
>>
>> --
>> Jason Williams
>> Systems Engineer
>> Homewood High Performance Cluster
>> Johns Hopkins University
>>
>> On 11/8/2011 12:09 PM, Dr. Stephan Raub wrote:
>>> Dear Mr. van der Vlies
>>>
>>> Currently we have 6095 Jobs queued and 93 Jobs running. Amoung these,
>>> we have some large job arrays (1000 and 4000 items per array).
>>>
>>> Best regards.
>>> --
>>> ---------------------------------------------------------
>>> | | Dr. rer. nat. Stephan Raub
>>> | | Dipl. Chem.
>>> | | High-Performance-Computing
>>> | | Zentrum für Informations- und Medientechnologie
>>> | | Heinrich-Heine-Universität Düsseldorf Universitätsstr. 1 / Raum
>>> | | 25.41.O2.25-2
>>> | | 40225 Düsseldorf / Germany
>>> | |
>>> | | Tel: +49-211-811-3911
>>> | | Fax: +49-211-811-2539
>>> ---------------------------------------------------------
>>>
>>> Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder
>>> Geschäftsgeheimnisse, bzw.
>>> sonstige vertrauliche Informationen enthalten. Sollten Sie diese
>>> E-Mail irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des
>>> Inhalts, eine Vervielfältigung oder Weitergabe der E-Mail
>> ausdrücklich
>>> untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die
>>> empfangene E-Mail. Vielen Dank.
>>>
>>> Important Note: This e-mail may contain trade secrets or privileged,
>>> undisclosed or otherwise confidential information. If you have
>>> received this e-mail in error, you are hereby notified that any
>>> review, copying or distribution of it is strictly prohibited. Please
>>> inform us immediately and destroy the original transmittal. Thank you
>> for your cooperation.
>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Bas van der Vlies [mailto:basv at sara.nl]
>>>> Gesendet: Dienstag, 8. November 2011 17:10
>>>> An: Dr. Stephan Raub
>>>> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui
>>>>
>>>> On 08-11-11 16:40, Dr. Stephan Raub wrote:
>>>>> Dear fellow maui users,
>>>>>
>>>>> we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5
>>>>> (2.6.8-194.26.1.el1) on a 600-somewhat core cluster.
>>>>>
>>>>> We experienced a sudden death of the maui scheduler with no message
>>>> in the
>>>>> logs. We could not figure out a reason so we attached an "strace"
>> to
>>>> the
>>>>> maui process (as long as it was "still alive") and we got:
>>>>>
>>>> Dear Dr. Stephan Raub,
>>>>
>>>> just a question: How many jobs are in the queue?
>>>>
>>>> regards
>>>>
>>>>
>>>> --
>>>> ********************************************************************
>>>> *  Bas van der Vlies                    e-mail: basv at sara.nl       *
>>>> *  SARA - Academic Computing Services   Amsterdam, The Netherlands *
>>>> ********************************************************************
>>>
>>> _______________________________________________
>>> mauiusers mailing list
>>> mauiusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list