[Mauiusers] maui hangs/segfaults in 3.3.1

Paul Raines raines at nmr.mgh.harvard.edu
Wed Jul 18 07:07:09 MDT 2012


I tried putting a watch on MSched.statfp to see if I could catch it getting 
corrupted, but I just ended up with a segfault in a different location, this 
time in the fprintf right before the fflush it segfaulted in last time you see 
in the backtrace below.

So I went in an commented out all the CLASSCFG lines in my maui.cfg and 
restarted.  So far maui has been running longer than it ever has before 
without hanging or crashing.  However, the whole reason for the CLASSCFG lines 
was that maui seemed in the past to be ignoring the max_user_run set for each 
of my queues.  I will need to monitor things to see if that is still the case.

One related question.  What I really want to limit on a per queue basis is
not number of jobs but number of CPUs a user has running.  Is there anyway
to do that?

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Tue, 17 Jul 2012 4:03pm, Steve Johnson wrote:

> On 07/17/2012 02:05 PM, Paul Raines wrote:
>> No, I know nothing about that.  I think I can remove most of those CLASSCFG
>> lines as I was having problems in a previous torque getting max_user_run
>> to actually work.  Or will just the fact that I have more than 16 queues
>> defined in torque still be a problem?
>> 
>> Seems like maui should then give an error at startup saying too many 
>> CLASSCFG
>> in the config if MAX_CLASS is exceeded.
>
> IIRC, maui will ignore any classes > 16, so it probably isn't clobbering 
> memory elsewhere.  But if you notice queues not getting scheduled, that limit 
> will be the problem unless you have a CLASSCFG[DEFAULT] defined.
>
>> Where is this documented?  What is the difference between MAX_MCLASS 
>> (default
>> 64) and MAX_CLASS (default 16)?
>
> Documented? Heh...good one. ;)
>
> It looks like MMAX_CLASS is used in src/moab/Mutil.c and src/mcom/MS3I.c, 
> whereas MAX_MCLASS is more widely used throughout the code.  Not sure if 
> they're directly related.
>
> You might check if there's a particular job that's triggering the 
> segfault/hang and see if there's anything abnormal in its characteristics in 
> Torque (uid, gid, super long or "strange" strings/paths, etc).  Try setting a 
> break in MJobWriteStats and examine variables. If you find a bogus address, 
> work backward to see where it got clobbered. Sorry I can't offer more help.
>
> I had a crashing problem a couple weeks ago, but it appears to be unrelated. 
> I followed the same path as you with gdb and also inserted some conditional 
> printf's in the source to finally track it down to MMAX_JOBRA set too low. 
> Sadly, the process took several hours.  Why such limits are hardcoded is 
> beyond me.
>
> // Steve
>
>
>> 
>> Thanks
>> 
>> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>> 
>> 
>> 
>> On Tue, 17 Jul 2012 2:56pm, Steve Johnson wrote:
>> 
>>> It looks like you have 17 CLASSCFG lines.  Have you increased MAX_MCLASS 
>>> and
>>> MMAX_CLASS in include/msched-common.h?
>>> 
>>> // Steve
>>> 
>>> 
>>> On 07/17/2012 12:42 PM, Paul Raines wrote:
>>>> 
>>>> We have two separate clusters. One is an ancient cluster with nodes that 
>>>> are
>>>> dual Opterons and 4G RAM.  The other is newer with dual quad Xeon E5472's 
>>>> and
>>>> 32G RAM.  Recently we updated both clusters to CentOS6, torque-2.5.11 and
>>>> maui 3.3.1.  So OS/software/config - wise they are identical.  I built
>>>> torque/maui RPMs myself on an old Opteron node to install on both 
>>>> clusters.
>>>> 
>>>> The older cluster has been running without any problems.  On the new one
>>>> though maui keeps hanging or segfaulting within 1-8 hours of starting 
>>>> maui.
>>>> I installed the debuginfo RPMS and run maui in the debugger.
>>>> 
>>>> When it just hangs (doesn't crash but doesn't respond to any tools such
>>>> as showq), this is what I see:
>>>> 
>>>> =========================================================================
>>>> (gdb) run -d
>>>> Starting program: /usr/sbin/maui -d
>>>> *** glibc detected *** /usr/sbin/maui: corrupted double-linked list:
>>>> 0x000000000
>>>> 7f106a0 ***
>>>> 
>>>> 
>>>> ^C
>>>> Program received signal SIGINT, Interrupt.
>>>> 0x00000036cd2f542e in __lll_lock_wait_private () from /lib64/libc.so.6
>>>> (gdb) bt
>>>> #0  0x00000036cd2f542e in __lll_lock_wait_private () from 
>>>> /lib64/libc.so.6
>>>> #1  0x00000036cd27bed5 in _L_lock_9323 () from /lib64/libc.so.6
>>>> #2  0x00000036cd2797c6 in malloc () from /lib64/libc.so.6
>>>> #3  0x00000036cca04c72 in local_strdup () from 
>>>> /lib64/ld-linux-x86-64.so.2
>>>> #4  0x00000036cca08636 in _dl_map_object () from 
>>>> /lib64/ld-linux-x86-64.so.2
>>>> #5  0x00000036cca12994 in dl_open_worker () from 
>>>> /lib64/ld-linux-x86-64.so.2
>>>> #6  0x00000036cca0e176 in _dl_catch_error () from 
>>>> /lib64/ld-linux-x86-64.so.2
>>>> #7  0x00000036cca1244a in _dl_open () from /lib64/ld-linux-x86-64.so.2
>>>> #8  0x00000036cd323520 in do_dlopen () from /lib64/libc.so.6
>>>> #9  0x00000036cca0e176 in _dl_catch_error () from 
>>>> /lib64/ld-linux-x86-64.so.2
>>>> #10 0x00000036cd323677 in __libc_dlopen_mode () from /lib64/libc.so.6
>>>> #11 0x00000036cd2fbd51 in backtrace () from /lib64/libc.so.6
>>>> #12 0x00000036cd26f98b in __libc_message () from /lib64/libc.so.6
>>>> #13 0x00000036cd275296 in malloc_printerr () from /lib64/libc.so.6
>>>> #14 0x00000036cd277efa in _int_free () from /lib64/libc.so.6
>>>> #15 0x0000000000466136 in MUFree (Ptr=0x46bfbd0) at MUtil.c:460
>>>> #16 0x00000000004499a5 in MUserDestroy (UP=0x46bfbd0) at MUser.c:682
>>>> #17 0x00000000004499de in MUserFreeTable () at MUser.c:700
>>>> #18 0x00000000004ac48f in MSysShutdown (Signo=0) at MSys.c:2540
>>>> #19 0x0000000000418361 in UIProcessClients (SS=0x774d270,
>>>>       TimeLimit=<value optimized out>) at UserI.c:527
>>>> #20 0x0000000000405bb8 in main (ArgC=2, ArgV=<value optimized out>)
>>>>       at Server.c:240
>>>> (gdb) quit
>>>> =========================================================================
>>>> 
>>>> 
>>>> When it crashes this is what I see
>>>> 
>>>> =========================================================================
>>>> (gdb) run -d
>>>> Starting program: /usr/sbin/maui -d
>>>> 
>>>> 
>>>> Program received signal SIGSEGV, Segmentation fault.
>>>> 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
>>>> 43            result = _IO_SYNC (fp) ? EOF : 0;
>>>> (gdb)
>>>> (gdb) bt
>>>> #0  0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
>>>> #1  0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815
>>>> #2  0x000000000048643e in MJobProcessCompleted (J=0x9b61080) at 
>>>> MJob.c:9562
>>>> #3  0x00000000004a6eb8 in MPBSWorkloadQuery (R=0x6a4b2e0,
>>>>       JCount=0x7ffffff7b938, SC=<value optimized out>) at MPBSI.c:871
>>>> #4  0x000000000045f926 in __MUTFunc (V=0x7ffffff7b830) at MUtil.c:4718
>>>> #5  0x0000000000462387 in MUThread (F=<value optimized out>,
>>>>       TimeOut=<value optimized out>, RC=<value optimized out>,
>>>>       ACount=<value optimized out>, Lock=<value optimized out>) at
>>>> MUtil.c:4691
>>>> #6  0x0000000000498ed4 in MRMWorkloadQuery (WCount=0x7ffffff7b98c, 
>>>> SC=0x0)
>>>>       at MRM.c:595
>>>> #7  0x000000000049cb19 in MRMGetInfo () at MRM.c:364
>>>> #8  0x000000000042dc42 in MSchedProcessJobs (OldDay=0x7fffffffde40 "Tue",
>>>>       GlobalSQ=0x7ffffffdbe30, GlobalHQ=0x7ffffffbbe30) at MSched.c:6930
>>>> #9  0x0000000000405c46 in main (ArgC=2, ArgV=<value optimized out>)
>>>>       at Server.c:192
>>>> (gdb) frame
>>>> #0  0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
>>>> 43            result = _IO_SYNC (fp) ? EOF : 0;
>>>> (gdb) frame 1
>>>> #1  0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815
>>>> 7815        fflush(MSched.statfp);
>>>> (gdb) list MJob.c:7815
>>>> 7810
>>>> 7811      if 
>>>> (MJobToTString(J,DEFAULT_WORKLOAD_TRACE_VERSION,Buf,sizeof(Buf))
>>>> == SUCCESS)
>>>> 7812        {
>>>> 7813        fprintf(MSched.statfp,"%s",Buf);
>>>> 7814
>>>> 7815        fflush(MSched.statfp);
>>>> 7816
>>>> 7817        DBG(4,fSTAT) DPrint("INFO:     job stats written for '%s'\n",
>>>> 7818          J->Name);
>>>> 7819        }
>>>> (gdb) p Buf
>>>> $3 = "16828", ' ' <repeats 18 times>, "0   1    coutu     coutu  345600
>>>> Completed  [max100:1] 1342534818 1342534819 1342534819 1342535999 
>>>> [NONE]
>>>> [NONE] [NONE] >=    0M >=      0M   [nonGPU] 1342534818   1    1
>>>> [NONE]:DEFA"...
>>>> (gdb)
>>>> =========================================================================
>>>> 
>>>> My guess is some memory corruption has overwritten MSched.statfp which is
>>>> just a file handle and thus fflush crashes when it actually tries to
>>>> write to it.   WHere that overwrite is occuring though is anyone's guess.
>>>> 
>>>> I am hoping someone on this list might have a clue.  It is really a 
>>>> mystery
>>>> to me why I only see this on one cluster. They are exactly the same 
>>>> config
>>>> except for host name.  Here is my maui.cfg
>>>> 
>>>> =========================================================================
>>>> ADMIN1                maui root
>>>> ADMIN3                ALL
>>>> ADMINHOST               launchpad.nmr.mgh.harvard.edu
>>>> BACKFILLPOLICY        FIRSTFIT
>>>> CLASSCFG[default] MAXPROCPERUSER=150
>>>> CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250
>>>> CLASSCFG[GPU] MAXPROCPERUSER=5000
>>>> CLASSCFG[matlab] MAXPROCPERUSER=60
>>>> CLASSCFG[max100] MAXPROCPERUSER=100
>>>> CLASSCFG[max10] MAXPROCPERUSER=10
>>>> CLASSCFG[max200] MAXPROCPERUSER=200
>>>> CLASSCFG[max20] MAXPROCPERUSER=20
>>>> CLASSCFG[max50] MAXPROCPERUSER=50
>>>> CLASSCFG[max75] MAXPROCPERUSER=75
>>>> CLASSCFG[p10] MAXPROCPERUSER=5000
>>>> CLASSCFG[p20] MAXPROCPERUSER=5000
>>>> CLASSCFG[p30] MAXPROCPERUSER=5000
>>>> CLASSCFG[p40] MAXPROCPERUSER=5000
>>>> CLASSCFG[p50] MAXPROCPERUSER=30
>>>> CLASSCFG[p5] MAXPROCPERUSER=5000
>>>> CLASSCFG[p60] MAXPROCPERUSER=20
>>>> CLASSWEIGHT           10
>>>> ENABLEMULTIREQJOBS TRUE
>>>> ENFORCERESOURCELIMITS   OFF
>>>> LOGFILEMAXSIZE        1000000000
>>>> LOGFILE               /var/spool/maui/log/maui.log
>>>> LOGLEVEL              2
>>>> NODEALLOCATIONPOLICY  PRIORITY
>>>> NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT'
>>>> QUEUETIMEWEIGHT       1
>>>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>> RMCFG[base]             TYPE=PBS
>>>> RMPOLLINTERVAL          00:00:30
>>>> SERVERHOST              launchpad.nmr.mgh.harvard.edu
>>>> SERVERMODE              NORMAL
>>>> SERVERPORT              40559
>>>> USERCFG[DEFAULT] MAXIPROC=8
>>>> USERCFG[jonghwan] MAXPROC=300
>>>> USERCFG[shafee] MAXPROC=300
>>>> 
>>>> I actually changed the LOGLEVEL from 3 to 2 at one point thinking the
>>>> error is happening when writing to the log and lowering the amount it
>>>> writes might affect things, but it didn't help
>>>> 
>>>> ---------------------------------------------------------------
>>>> Paul Raines                     http://help.nmr.mgh.harvard.edu
>>>> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
>>>> 149 (2301) 13th Street     Charlestown, MA 02129        USA
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> The information in this e-mail is intended only for the person to whom it 
>>>> is
>>>> addressed. If you believe this e-mail was sent to you in error and the 
>>>> e-mail
>>>> contains patient information, please contact the Partners Compliance
>>>> HelpLine at
>>>> http://www.partners.org/complianceline . If the e-mail was sent to you in
>>>> error
>>>> but does not contain patient information, please contact the sender and
>>>> properly
>>>> dispose of the e-mail.
>>>> 
>>>> _______________________________________________
>>>> mauiusers mailing list
>>>> mauiusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>> 
>>> 
>>> 
>>> 
>
>
>


More information about the mauiusers mailing list