[Mauiusers] maui hangs/segfaults in 3.3.1

Steve Johnson steve at isc.tamu.edu
Tue Jul 17 14:03:19 MDT 2012


On 07/17/2012 02:05 PM, Paul Raines wrote:
> No, I know nothing about that.  I think I can remove most of those CLASSCFG
> lines as I was having problems in a previous torque getting max_user_run
> to actually work.  Or will just the fact that I have more than 16 queues
> defined in torque still be a problem?
>
> Seems like maui should then give an error at startup saying too many CLASSCFG
> in the config if MAX_CLASS is exceeded.

IIRC, maui will ignore any classes > 16, so it probably isn't clobbering 
memory elsewhere.  But if you notice queues not getting scheduled, that limit 
will be the problem unless you have a CLASSCFG[DEFAULT] defined.

> Where is this documented?  What is the difference between MAX_MCLASS (default
> 64) and MAX_CLASS (default 16)?

Documented? Heh...good one. ;)

It looks like MMAX_CLASS is used in src/moab/Mutil.c and src/mcom/MS3I.c, 
whereas MAX_MCLASS is more widely used throughout the code.  Not sure if 
they're directly related.

You might check if there's a particular job that's triggering the 
segfault/hang and see if there's anything abnormal in its characteristics in 
Torque (uid, gid, super long or "strange" strings/paths, etc).  Try setting a 
break in MJobWriteStats and examine variables. If you find a bogus address, 
work backward to see where it got clobbered. Sorry I can't offer more help.

I had a crashing problem a couple weeks ago, but it appears to be unrelated. 
I followed the same path as you with gdb and also inserted some conditional 
printf's in the source to finally track it down to MMAX_JOBRA set too low. 
Sadly, the process took several hours.  Why such limits are hardcoded is 
beyond me.

// Steve


>
> Thanks
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
>
>
> On Tue, 17 Jul 2012 2:56pm, Steve Johnson wrote:
>
>> It looks like you have 17 CLASSCFG lines.  Have you increased MAX_MCLASS and
>> MMAX_CLASS in include/msched-common.h?
>>
>> // Steve
>>
>>
>> On 07/17/2012 12:42 PM, Paul Raines wrote:
>>>
>>> We have two separate clusters. One is an ancient cluster with nodes that are
>>> dual Opterons and 4G RAM.  The other is newer with dual quad Xeon E5472's and
>>> 32G RAM.  Recently we updated both clusters to CentOS6, torque-2.5.11 and
>>> maui 3.3.1.  So OS/software/config - wise they are identical.  I built
>>> torque/maui RPMs myself on an old Opteron node to install on both clusters.
>>>
>>> The older cluster has been running without any problems.  On the new one
>>> though maui keeps hanging or segfaulting within 1-8 hours of starting maui.
>>> I installed the debuginfo RPMS and run maui in the debugger.
>>>
>>> When it just hangs (doesn't crash but doesn't respond to any tools such
>>> as showq), this is what I see:
>>>
>>> =========================================================================
>>> (gdb) run -d
>>> Starting program: /usr/sbin/maui -d
>>> *** glibc detected *** /usr/sbin/maui: corrupted double-linked list:
>>> 0x000000000
>>> 7f106a0 ***
>>>
>>>
>>> ^C
>>> Program received signal SIGINT, Interrupt.
>>> 0x00000036cd2f542e in __lll_lock_wait_private () from /lib64/libc.so.6
>>> (gdb) bt
>>> #0  0x00000036cd2f542e in __lll_lock_wait_private () from /lib64/libc.so.6
>>> #1  0x00000036cd27bed5 in _L_lock_9323 () from /lib64/libc.so.6
>>> #2  0x00000036cd2797c6 in malloc () from /lib64/libc.so.6
>>> #3  0x00000036cca04c72 in local_strdup () from /lib64/ld-linux-x86-64.so.2
>>> #4  0x00000036cca08636 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2
>>> #5  0x00000036cca12994 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
>>> #6  0x00000036cca0e176 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
>>> #7  0x00000036cca1244a in _dl_open () from /lib64/ld-linux-x86-64.so.2
>>> #8  0x00000036cd323520 in do_dlopen () from /lib64/libc.so.6
>>> #9  0x00000036cca0e176 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
>>> #10 0x00000036cd323677 in __libc_dlopen_mode () from /lib64/libc.so.6
>>> #11 0x00000036cd2fbd51 in backtrace () from /lib64/libc.so.6
>>> #12 0x00000036cd26f98b in __libc_message () from /lib64/libc.so.6
>>> #13 0x00000036cd275296 in malloc_printerr () from /lib64/libc.so.6
>>> #14 0x00000036cd277efa in _int_free () from /lib64/libc.so.6
>>> #15 0x0000000000466136 in MUFree (Ptr=0x46bfbd0) at MUtil.c:460
>>> #16 0x00000000004499a5 in MUserDestroy (UP=0x46bfbd0) at MUser.c:682
>>> #17 0x00000000004499de in MUserFreeTable () at MUser.c:700
>>> #18 0x00000000004ac48f in MSysShutdown (Signo=0) at MSys.c:2540
>>> #19 0x0000000000418361 in UIProcessClients (SS=0x774d270,
>>>       TimeLimit=<value optimized out>) at UserI.c:527
>>> #20 0x0000000000405bb8 in main (ArgC=2, ArgV=<value optimized out>)
>>>       at Server.c:240
>>> (gdb) quit
>>> =========================================================================
>>>
>>>
>>> When it crashes this is what I see
>>>
>>> =========================================================================
>>> (gdb) run -d
>>> Starting program: /usr/sbin/maui -d
>>>
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
>>> 43            result = _IO_SYNC (fp) ? EOF : 0;
>>> (gdb)
>>> (gdb) bt
>>> #0  0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
>>> #1  0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815
>>> #2  0x000000000048643e in MJobProcessCompleted (J=0x9b61080) at MJob.c:9562
>>> #3  0x00000000004a6eb8 in MPBSWorkloadQuery (R=0x6a4b2e0,
>>>       JCount=0x7ffffff7b938, SC=<value optimized out>) at MPBSI.c:871
>>> #4  0x000000000045f926 in __MUTFunc (V=0x7ffffff7b830) at MUtil.c:4718
>>> #5  0x0000000000462387 in MUThread (F=<value optimized out>,
>>>       TimeOut=<value optimized out>, RC=<value optimized out>,
>>>       ACount=<value optimized out>, Lock=<value optimized out>) at
>>> MUtil.c:4691
>>> #6  0x0000000000498ed4 in MRMWorkloadQuery (WCount=0x7ffffff7b98c, SC=0x0)
>>>       at MRM.c:595
>>> #7  0x000000000049cb19 in MRMGetInfo () at MRM.c:364
>>> #8  0x000000000042dc42 in MSchedProcessJobs (OldDay=0x7fffffffde40 "Tue",
>>>       GlobalSQ=0x7ffffffdbe30, GlobalHQ=0x7ffffffbbe30) at MSched.c:6930
>>> #9  0x0000000000405c46 in main (ArgC=2, ArgV=<value optimized out>)
>>>       at Server.c:192
>>> (gdb) frame
>>> #0  0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
>>> 43            result = _IO_SYNC (fp) ? EOF : 0;
>>> (gdb) frame 1
>>> #1  0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815
>>> 7815        fflush(MSched.statfp);
>>> (gdb) list MJob.c:7815
>>> 7810
>>> 7811      if (MJobToTString(J,DEFAULT_WORKLOAD_TRACE_VERSION,Buf,sizeof(Buf))
>>> == SUCCESS)
>>> 7812        {
>>> 7813        fprintf(MSched.statfp,"%s",Buf);
>>> 7814
>>> 7815        fflush(MSched.statfp);
>>> 7816
>>> 7817        DBG(4,fSTAT) DPrint("INFO:     job stats written for '%s'\n",
>>> 7818          J->Name);
>>> 7819        }
>>> (gdb) p Buf
>>> $3 = "16828", ' ' <repeats 18 times>, "0   1    coutu     coutu  345600
>>> Completed  [max100:1] 1342534818 1342534819 1342534819 1342535999    [NONE]
>>> [NONE] [NONE] >=    0M >=      0M   [nonGPU] 1342534818   1    1
>>> [NONE]:DEFA"...
>>> (gdb)
>>> =========================================================================
>>>
>>> My guess is some memory corruption has overwritten MSched.statfp which is
>>> just a file handle and thus fflush crashes when it actually tries to
>>> write to it.   WHere that overwrite is occuring though is anyone's guess.
>>>
>>> I am hoping someone on this list might have a clue.  It is really a mystery
>>> to me why I only see this on one cluster. They are exactly the same config
>>> except for host name.  Here is my maui.cfg
>>>
>>> =========================================================================
>>> ADMIN1                maui root
>>> ADMIN3                ALL
>>> ADMINHOST               launchpad.nmr.mgh.harvard.edu
>>> BACKFILLPOLICY        FIRSTFIT
>>> CLASSCFG[default] MAXPROCPERUSER=150
>>> CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250
>>> CLASSCFG[GPU] MAXPROCPERUSER=5000
>>> CLASSCFG[matlab] MAXPROCPERUSER=60
>>> CLASSCFG[max100] MAXPROCPERUSER=100
>>> CLASSCFG[max10] MAXPROCPERUSER=10
>>> CLASSCFG[max200] MAXPROCPERUSER=200
>>> CLASSCFG[max20] MAXPROCPERUSER=20
>>> CLASSCFG[max50] MAXPROCPERUSER=50
>>> CLASSCFG[max75] MAXPROCPERUSER=75
>>> CLASSCFG[p10] MAXPROCPERUSER=5000
>>> CLASSCFG[p20] MAXPROCPERUSER=5000
>>> CLASSCFG[p30] MAXPROCPERUSER=5000
>>> CLASSCFG[p40] MAXPROCPERUSER=5000
>>> CLASSCFG[p50] MAXPROCPERUSER=30
>>> CLASSCFG[p5] MAXPROCPERUSER=5000
>>> CLASSCFG[p60] MAXPROCPERUSER=20
>>> CLASSWEIGHT           10
>>> ENABLEMULTIREQJOBS TRUE
>>> ENFORCERESOURCELIMITS   OFF
>>> LOGFILEMAXSIZE        1000000000
>>> LOGFILE               /var/spool/maui/log/maui.log
>>> LOGLEVEL              2
>>> NODEALLOCATIONPOLICY  PRIORITY
>>> NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT'
>>> QUEUETIMEWEIGHT       1
>>> RESERVATIONPOLICY     CURRENTHIGHEST
>>> RMCFG[base]             TYPE=PBS
>>> RMPOLLINTERVAL          00:00:30
>>> SERVERHOST              launchpad.nmr.mgh.harvard.edu
>>> SERVERMODE              NORMAL
>>> SERVERPORT              40559
>>> USERCFG[DEFAULT] MAXIPROC=8
>>> USERCFG[jonghwan] MAXPROC=300
>>> USERCFG[shafee] MAXPROC=300
>>>
>>> I actually changed the LOGLEVEL from 3 to 2 at one point thinking the
>>> error is happening when writing to the log and lowering the amount it
>>> writes might affect things, but it didn't help
>>>
>>> ---------------------------------------------------------------
>>> Paul Raines                     http://help.nmr.mgh.harvard.edu
>>> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
>>> 149 (2301) 13th Street     Charlestown, MA 02129        USA
>>>
>>>
>>>
>>>
>>>
>>> The information in this e-mail is intended only for the person to whom it is
>>> addressed. If you believe this e-mail was sent to you in error and the e-mail
>>> contains patient information, please contact the Partners Compliance
>>> HelpLine at
>>> http://www.partners.org/complianceline . If the e-mail was sent to you in
>>> error
>>> but does not contain patient information, please contact the sender and
>>> properly
>>> dispose of the e-mail.
>>>
>>> _______________________________________________
>>> mauiusers mailing list
>>> mauiusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>
>>
>>
>>



More information about the mauiusers mailing list