[Mauiusers] maui hangs/segfaults in 3.3.1

Paul Raines raines at nmr.mgh.harvard.edu
Tue Jul 17 11:42:58 MDT 2012


We have two separate clusters. One is an ancient cluster with nodes that are 
dual Opterons and 4G RAM.  The other is newer with dual quad Xeon E5472's and
32G RAM.  Recently we updated both clusters to CentOS6, torque-2.5.11 and
maui 3.3.1.  So OS/software/config - wise they are identical.  I built
torque/maui RPMs myself on an old Opteron node to install on both clusters.

The older cluster has been running without any problems.  On the new one 
though maui keeps hanging or segfaulting within 1-8 hours of starting maui. 
I installed the debuginfo RPMS and run maui in the debugger.

When it just hangs (doesn't crash but doesn't respond to any tools such
as showq), this is what I see:

=========================================================================
(gdb) run -d
Starting program: /usr/sbin/maui -d
*** glibc detected *** /usr/sbin/maui: corrupted double-linked list: 
0x000000000
7f106a0 ***


^C
Program received signal SIGINT, Interrupt.
0x00000036cd2f542e in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) bt
#0  0x00000036cd2f542e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00000036cd27bed5 in _L_lock_9323 () from /lib64/libc.so.6
#2  0x00000036cd2797c6 in malloc () from /lib64/libc.so.6
#3  0x00000036cca04c72 in local_strdup () from /lib64/ld-linux-x86-64.so.2
#4  0x00000036cca08636 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2
#5  0x00000036cca12994 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6  0x00000036cca0e176 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7  0x00000036cca1244a in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8  0x00000036cd323520 in do_dlopen () from /lib64/libc.so.6
#9  0x00000036cca0e176 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x00000036cd323677 in __libc_dlopen_mode () from /lib64/libc.so.6
#11 0x00000036cd2fbd51 in backtrace () from /lib64/libc.so.6
#12 0x00000036cd26f98b in __libc_message () from /lib64/libc.so.6
#13 0x00000036cd275296 in malloc_printerr () from /lib64/libc.so.6
#14 0x00000036cd277efa in _int_free () from /lib64/libc.so.6
#15 0x0000000000466136 in MUFree (Ptr=0x46bfbd0) at MUtil.c:460
#16 0x00000000004499a5 in MUserDestroy (UP=0x46bfbd0) at MUser.c:682
#17 0x00000000004499de in MUserFreeTable () at MUser.c:700
#18 0x00000000004ac48f in MSysShutdown (Signo=0) at MSys.c:2540
#19 0x0000000000418361 in UIProcessClients (SS=0x774d270,
     TimeLimit=<value optimized out>) at UserI.c:527
#20 0x0000000000405bb8 in main (ArgC=2, ArgV=<value optimized out>)
     at Server.c:240
(gdb) quit
=========================================================================


When it crashes this is what I see

=========================================================================
(gdb) run -d
Starting program: /usr/sbin/maui -d


Program received signal SIGSEGV, Segmentation fault.
0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
43            result = _IO_SYNC (fp) ? EOF : 0;
(gdb)
(gdb) bt
#0  0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
#1  0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815
#2  0x000000000048643e in MJobProcessCompleted (J=0x9b61080) at MJob.c:9562
#3  0x00000000004a6eb8 in MPBSWorkloadQuery (R=0x6a4b2e0,
     JCount=0x7ffffff7b938, SC=<value optimized out>) at MPBSI.c:871
#4  0x000000000045f926 in __MUTFunc (V=0x7ffffff7b830) at MUtil.c:4718
#5  0x0000000000462387 in MUThread (F=<value optimized out>,
     TimeOut=<value optimized out>, RC=<value optimized out>,
     ACount=<value optimized out>, Lock=<value optimized out>) at MUtil.c:4691
#6  0x0000000000498ed4 in MRMWorkloadQuery (WCount=0x7ffffff7b98c, SC=0x0)
     at MRM.c:595
#7  0x000000000049cb19 in MRMGetInfo () at MRM.c:364
#8  0x000000000042dc42 in MSchedProcessJobs (OldDay=0x7fffffffde40 "Tue",
     GlobalSQ=0x7ffffffdbe30, GlobalHQ=0x7ffffffbbe30) at MSched.c:6930
#9  0x0000000000405c46 in main (ArgC=2, ArgV=<value optimized out>)
     at Server.c:192
(gdb) frame
#0  0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
43            result = _IO_SYNC (fp) ? EOF : 0;
(gdb) frame 1
#1  0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815
7815        fflush(MSched.statfp);
(gdb) list MJob.c:7815
7810
7811      if (MJobToTString(J,DEFAULT_WORKLOAD_TRACE_VERSION,Buf,sizeof(Buf)) 
== SUCCESS)
7812        {
7813        fprintf(MSched.statfp,"%s",Buf);
7814
7815        fflush(MSched.statfp);
7816
7817        DBG(4,fSTAT) DPrint("INFO:     job stats written for '%s'\n",
7818          J->Name);
7819        }
(gdb) p Buf
$3 = "16828", ' ' <repeats 18 times>, "0   1    coutu     coutu  345600 
Completed  [max100:1] 1342534818 1342534819 1342534819 1342535999    [NONE] 
[NONE] [NONE] >=    0M >=      0M   [nonGPU] 1342534818   1    1 
[NONE]:DEFA"...
(gdb)
=========================================================================

My guess is some memory corruption has overwritten MSched.statfp which is
just a file handle and thus fflush crashes when it actually tries to
write to it.   WHere that overwrite is occuring though is anyone's guess.

I am hoping someone on this list might have a clue.  It is really a mystery
to me why I only see this on one cluster. They are exactly the same config
except for host name.  Here is my maui.cfg

=========================================================================
ADMIN1                maui root
ADMIN3                ALL
ADMINHOST               launchpad.nmr.mgh.harvard.edu
BACKFILLPOLICY        FIRSTFIT
CLASSCFG[default] MAXPROCPERUSER=150
CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250
CLASSCFG[GPU] MAXPROCPERUSER=5000
CLASSCFG[matlab] MAXPROCPERUSER=60
CLASSCFG[max100] MAXPROCPERUSER=100
CLASSCFG[max10] MAXPROCPERUSER=10
CLASSCFG[max200] MAXPROCPERUSER=200
CLASSCFG[max20] MAXPROCPERUSER=20
CLASSCFG[max50] MAXPROCPERUSER=50
CLASSCFG[max75] MAXPROCPERUSER=75
CLASSCFG[p10] MAXPROCPERUSER=5000
CLASSCFG[p20] MAXPROCPERUSER=5000
CLASSCFG[p30] MAXPROCPERUSER=5000
CLASSCFG[p40] MAXPROCPERUSER=5000
CLASSCFG[p50] MAXPROCPERUSER=30
CLASSCFG[p5] MAXPROCPERUSER=5000
CLASSCFG[p60] MAXPROCPERUSER=20
CLASSWEIGHT           10
ENABLEMULTIREQJOBS TRUE
ENFORCERESOURCELIMITS   OFF
LOGFILEMAXSIZE        1000000000
LOGFILE               /var/spool/maui/log/maui.log
LOGLEVEL              2
NODEALLOCATIONPOLICY  PRIORITY
NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT'
QUEUETIMEWEIGHT       1
RESERVATIONPOLICY     CURRENTHIGHEST
RMCFG[base]             TYPE=PBS
RMPOLLINTERVAL          00:00:30
SERVERHOST              launchpad.nmr.mgh.harvard.edu
SERVERMODE              NORMAL
SERVERPORT              40559
USERCFG[DEFAULT] MAXIPROC=8
USERCFG[jonghwan] MAXPROC=300
USERCFG[shafee] MAXPROC=300

I actually changed the LOGLEVEL from 3 to 2 at one point thinking the
error is happening when writing to the log and lowering the amount it
writes might affect things, but it didn't help

---------------------------------------------------------------
Paul Raines                     http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street     Charlestown, MA 02129	    USA





The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.



More information about the mauiusers mailing list