[Mauiusers] Possible Memory Corruption in maui

Jason Williams jasonw at Jhu.edu
Wed Nov 9 13:25:58 MST 2011


Dr. Raub,

If you find a solution to the problem, please let me know and/or post 
back to the maui list.  I don't really monitor the torque lists, they're 
a bit higher volume. I'll also be curious to know if they try to pass 
the issue back to the maui side and/or if they don't respond.  I'm just 
glad you are hopefully headed toward a solution now.

--
Jason

On 11/9/2011 2:39 PM, Dr. Stephan Raub wrote:
> Hello,
>
>
>> However, I'm not entirely convinced this is a
>> maui bug as the last calls before the libc calls are ones through the
>> torque library.
> I totally agree. We dived into the code of maui and found out, that the
> error occurs while calling "pbs_statnode()" (MPBSI.c, line 1268). The
> "memory corruption" seems to be thrown not in maui but in the called
> torque-function PBSD_status_get() (which is called by PBSD_status()) in
> PBSD_status.c. Currently, we assume an error in building the (struct
> batch_status) *next entries of this list.
>
> It seems, I have to apologize for bothering the maui list with this problem.
> ;-) Thank you for all of you for your comments and suggestions. It
> eventually has lead us in the right direction.
>
> Best regards
>
> Stephan
> --
> ---------------------------------------------------------
> | | Dr. rer. nat. Stephan Raub
> | | Dipl. Chem.
> | | High-Performance-Computing
> | | Zentrum für Informations- und Medientechnologie
> | | Heinrich-Heine-Universität Düsseldorf
> | | Universitätsstr. 1 / Raum 25.41.O2.25-2
> | | 40225 Düsseldorf / Germany
> | |
> | | Tel: +49-211-811-3911
> | | Fax: +49-211-811-2539
> ---------------------------------------------------------
>
> Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Geschäftsgeheimnisse,
> bzw.
> sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail
> irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine
> Vervielfältigung oder Weitergabe der E-Mail ausdrücklich untersagt. Bitte
> benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen
> Dank.
>
> Important Note: This e-mail may contain trade secrets or privileged,
> undisclosed or otherwise confidential information. If you have received this
> e-mail in error, you are hereby notified that any review, copying or
> distribution of it is strictly prohibited. Please inform us immediately and
> destroy the original transmittal. Thank you for your cooperation.
>
>
>> -----Ursprüngliche Nachricht-----
>> Von: mauiusers-bounces at supercluster.org [mailto:mauiusers-
>> bounces at supercluster.org] Im Auftrag von Jason Williams
>> Gesendet: Mittwoch, 9. November 2011 15:26
>> An: mauiusers at supercluster.org
>> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui
>>
>> Dr. Stephan Raub and Joerg Blank,
>>
>> It looks like the binaries you all have are stripped of their debug
>> symbols which is going to make my idea of tracing the crash in the maui
>> code next to impossible.  However, I'm not entirely convinced this is a
>> maui bug as the last calls before the libc calls are ones through the
>> torque library.
>>
>> Joerg:  What version of Torque do you have?
>>
>> I think my next step here would be to either a) load any -debuginfo rpm
>> if you installed it via RPM and try running with the -d again to
>> hopefully get some debug info in the back trace b) try running maui -d
>> through gdb and see if you can get some useful information there and/or
>> c) if you compiled it from source, disable stripping the debug symbols
>> and recompile it to try to get some more information in the backtrace.
>> With out some useful information as to where in the maui binary things
>> are when the crash happens, I can't start looking to see what happened.
>>
>> --
>> Jason
>>
>> On 11/9/2011 4:38 AM, Dr. Stephan Raub wrote:
>>> Dear Jason Williams,
>>>
>>> thank you for your hint. Please, find below the result of our Maui
>>> running with the "-d" command line option (maui was running about 5
>>> minutes before it crashed):
>>>
>>> # /usr/local/maui/sbin/maui -d
>>> *** glibc detected *** /usr/local/maui/sbin/maui: malloc(): memory
>>> corruption: 0x00000000099243e0 ***
>>> ======= Backtrace: =========
>>> /lib64/libc.so.6[0x3300672fae]
>>> /lib64/libc.so.6(__libc_malloc+0x6e)[0x3300674cde]
>>>
>> /usr/local/torque/lib/libtorque.so.2(decode_DIS_replyCmd+0x266)[0x2ab2
>>> 78cb18
>>> e6]
>>> /usr/local/torque/lib/libtorque.so.2(PBSD_rdrpy+0x80)[0x2ab278cb56d0]
>>>
>> /usr/local/torque/lib/libtorque.so.2(PBSD_status_get+0x26)[0x2ab278cb6
>>> 786]
>>> /usr/local/maui/sbin/maui[0x4d9e59]
>>> /usr/local/maui/sbin/maui[0x48b8e4]
>>> /usr/local/maui/sbin/maui[0x48b84f]
>>> /usr/local/maui/sbin/maui[0x4ce81c]
>>> /usr/local/maui/sbin/maui[0x4ce39e]
>>> /usr/local/maui/sbin/maui[0x4419eb]
>>> /usr/local/maui/sbin/maui[0x403608]
>>> /lib64/libc.so.6(__libc_start_main+0xf4)[0x330061d994]
>>> /usr/local/maui/sbin/maui[0x402cd9]
>>> ======= Memory map: ========
>>> 00400000-0054f000 r-xp 00000000 08:03 50266128
>>> /usr/local/maui/sbin/maui 0074f000-00754000 rw-p 0014f000 08:03
>>> 50266128 /usr/local/maui/sbin/maui 00754000-02344000 rw-p 00754000
>>> 00:00 0 0984b000-188f1000 rw-p 0984b000 00:00 0 [heap]
>>> 3300200000-330021c000 r-xp 00000000 08:03 18186265 /lib64/ld-2.5.so
>>> 330041b000-330041c000 r--p 0001b000 08:03 18186265 /lib64/ld-2.5.so
>>> 330041c000-330041d000 rw-p 0001c000 08:03 18186265 /lib64/ld-2.5.so
>>> 3300600000-330074e000 r-xp 00000000 08:03 18186304 /lib64/libc-2.5.so
>>> 330074e000-330094d000 ---p 0014e000 08:03 18186304 /lib64/libc-2.5.so
>>> 330094d000-3300951000 r--p 0014d000 08:03 18186304 /lib64/libc-2.5.so
>>> 3300951000-3300952000 rw-p 00151000 08:03 18186304 /lib64/libc-2.5.so
>>> 3300952000-3300957000 rw-p 3300952000 00:00 0 3300a00000-3300a02000
>>> r-xp 00000000 08:03 18186457 /lib64/libdl-2.5.so 3300a02000-
>> 3300c02000
>>> ---p 00002000 08:03 18186457 /lib64/libdl-2.5.so 3300c02000-
>> 3300c03000
>>> r--p 00002000 08:03 18186457 /lib64/libdl-2.5.so 3300c03000-
>> 3300c04000
>>> rw-p 00003000 08:03 18186457 /lib64/libdl-2.5.so 3300e00000-
>> 3300e82000
>>> r-xp 00000000 08:03 18186543 /lib64/libm-2.5.so 3300e82000-3301081000
>>> ---p 00082000 08:03 18186543 /lib64/libm-2.5.so 3301081000-3301082000
>>> r--p 00081000 08:03 18186543 /lib64/libm-2.5.so 3301082000-3301083000
>>> rw-p 00082000 08:03 18186543 /lib64/libm-2.5.so 3303a00000-3303a0d000
>>> r-xp 00000000 08:03 18186545
>>> /lib64/libgcc_s-4.1.2-20080825.so.1
>>> 3303a0d000-3303c0d000 ---p 0000d000 08:03 18186545
>>> /lib64/libgcc_s-4.1.2-20080825.so.1
>>> 3303c0d000-3303c0e000 rw-p 0000d000 08:03 18186545
>>> /lib64/libgcc_s-4.1.2-20080825.so.1
>>> 3304a00000-3304a15000 r-xp 00000000 08:03 18186491
>>> /lib64/libselinux.so.1 3304a15000-3304c15000 ---p 00015000 08:03
>>> 18186491 /lib64/libselinux.so.1 3304c15000-3304c17000 rw-p 00015000
>>> 08:03 18186491 /lib64/libselinux.so.1 3304c17000-3304c18000 rw-p
>>> 3304c17000 00:00 0 3304e00000-3304e3b000 r-xp 00000000 08:03 18186479
>>> /lib64/libsepol.so.1 3304e3b000-330503b000 ---p 0003b000 08:03
>>> 18186479 /lib64/libsepol.so.1 330503b000-330503c000 rw-p 0003b000
>>> 08:03 18186479 /lib64/libsepol.so.1 330503c000-3305046000 rw-p
>>> 330503c000 00:00 0 3305e00000-3305e02000 r-xp 00000000 08:03 18186469
>>> /lib64/libkeyutils-1.3.so 3305e02000-3306001000 ---p 00002000 08:03
>>> 18186469 /lib64/libkeyutils-1.3.so 3306001000-3306002000 rw-p
>> 00001000
>>> 08:03 18186469 /lib64/libkeyutils-1.3.so 3306200000-3306211000 r-xp
>>> 00000000 08:03 18186474 /lib64/libresolv-2.5.so 3306211000-3306411000
>>> ---p 00011000 08:03 18186474 /lib64/libresolv-2.5.so
>>> 3306411000-3306412000 r--p 00Aborted
>>>
>>> Thank you for your efforts.
>>>
>>> Stephan
>>> --
>>> ---------------------------------------------------------
>>> | | Dr. rer. nat. Stephan Raub
>>> | | Dipl. Chem.
>>> | | High-Performance-Computing
>>> | | Zentrum für Informations- und Medientechnologie
>>> | | Heinrich-Heine-Universität Düsseldorf Universitätsstr. 1 / Raum
>>> | | 25.41.O2.25-2
>>> | | 40225 Düsseldorf / Germany
>>> | |
>>> | | Tel: +49-211-811-3911
>>> | | Fax: +49-211-811-2539
>>> ---------------------------------------------------------
>>>
>>> Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder
>>> Geschäftsgeheimnisse, bzw.
>>> sonstige vertrauliche Informationen enthalten. Sollten Sie diese
>>> E-Mail irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des
>>> Inhalts, eine Vervielfältigung oder Weitergabe der E-Mail
>> ausdrücklich
>>> untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die
>>> empfangene E-Mail. Vielen Dank.
>>>
>>> Important Note: This e-mail may contain trade secrets or privileged,
>>> undisclosed or otherwise confidential information. If you have
>>> received this e-mail in error, you are hereby notified that any
>>> review, copying or distribution of it is strictly prohibited. Please
>>> inform us immediately and destroy the original transmittal. Thank you
>> for your cooperation.
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: mauiusers-bounces at supercluster.org [mailto:mauiusers-
>>>> bounces at supercluster.org] Im Auftrag von Jason Williams
>>>> Gesendet: Dienstag, 8. November 2011 23:50
>>>> An: mauiusers at supercluster.org
>>>> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui
>>>>
>>>> Dr Stephan Raub,
>>>>
>>>> Maui does have some very odd "memory management" in it that has a
>>>> tendency to cause these types of crashes when run in high volume
>>>> situations without some tweaks and/or concessions.  I've tracked
>>>> down, and I think fixed, one in the latest svn trunk, but 3.3.1
>>>> should already have that fix in it.
>>>>
>>>> Can/have you tried running maui from the command line with the -d
>>>> line and catching the corrupt memory and back trace that comes out
>> of it?
>>>> Your original email has the strace, but it cuts off some of the
>>>> backtrace.  I might be able to see where in the code it's having
>>>> problems, if I can get the full back trace.
>>>>
>>>>
>>>> --
>>>> Jason Williams
>>>> Systems Engineer
>>>> Homewood High Performance Cluster
>>>> Johns Hopkins University
>>>>
>>>> On 11/8/2011 12:09 PM, Dr. Stephan Raub wrote:
>>>>> Dear Mr. van der Vlies
>>>>>
>>>>> Currently we have 6095 Jobs queued and 93 Jobs running. Amoung
>>>>> these, we have some large job arrays (1000 and 4000 items per
>> array).
>>>>> Best regards.
>>>>> --
>>>>> ---------------------------------------------------------
>>>>> | | Dr. rer. nat. Stephan Raub
>>>>> | | Dipl. Chem.
>>>>> | | High-Performance-Computing
>>>>> | | Zentrum für Informations- und Medientechnologie
>>>>> | | Heinrich-Heine-Universität Düsseldorf Universitätsstr. 1 / Raum
>>>>> | | 25.41.O2.25-2
>>>>> | | 40225 Düsseldorf / Germany
>>>>> | |
>>>>> | | Tel: +49-211-811-3911
>>>>> | | Fax: +49-211-811-2539
>>>>> ---------------------------------------------------------
>>>>>
>>>>> Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder
>>>>> Geschäftsgeheimnisse, bzw.
>>>>> sonstige vertrauliche Informationen enthalten. Sollten Sie diese
>>>>> E-Mail irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des
>>>>> Inhalts, eine Vervielfältigung oder Weitergabe der E-Mail
>>>> ausdrücklich
>>>>> untersagt. Bitte benachrichtigen Sie uns und vernichten Sie die
>>>>> empfangene E-Mail. Vielen Dank.
>>>>>
>>>>> Important Note: This e-mail may contain trade secrets or
>> privileged,
>>>>> undisclosed or otherwise confidential information. If you have
>>>>> received this e-mail in error, you are hereby notified that any
>>>>> review, copying or distribution of it is strictly prohibited.
>> Please
>>>>> inform us immediately and destroy the original transmittal. Thank
>>>>> you
>>>> for your cooperation.
>>>>>> -----Ursprüngliche Nachricht-----
>>>>>> Von: Bas van der Vlies [mailto:basv at sara.nl]
>>>>>> Gesendet: Dienstag, 8. November 2011 17:10
>>>>>> An: Dr. Stephan Raub
>>>>>> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui
>>>>>>
>>>>>> On 08-11-11 16:40, Dr. Stephan Raub wrote:
>>>>>>> Dear fellow maui users,
>>>>>>>
>>>>>>> we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5
>>>>>>> (2.6.8-194.26.1.el1) on a 600-somewhat core cluster.
>>>>>>>
>>>>>>> We experienced a sudden death of the maui scheduler with no
>>>>>>> message
>>>>>> in the
>>>>>>> logs. We could not figure out a reason so we attached an "strace"
>>>> to
>>>>>> the
>>>>>>> maui process (as long as it was "still alive") and we got:
>>>>>>>
>>>>>> Dear Dr. Stephan Raub,
>>>>>>
>>>>>> just a question: How many jobs are in the queue?
>>>>>>
>>>>>> regards
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>> ********************************************************************
>>>>>> *  Bas van der Vlies                    e-mail: basv at sara.nl
>> *
>>>>>> *  SARA - Academic Computing Services   Amsterdam, The Netherlands
>> *
>> *******************************************************************
>>>>>> *
>>>>> _______________________________________________
>>>>> mauiusers mailing list
>>>>> mauiusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>> _______________________________________________
>>>> mauiusers mailing list
>>>> mauiusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>> _______________________________________________
>>> mauiusers mailing list
>>> mauiusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list