[Mauiusers] Maui crash `diagnose -p`

Dave Jackson jacksond at clusterresources.com
Thu Jan 19 11:26:01 MST 2006


Simon,

  The failure is resulting from a buffer overflow associated with the
large number of jobs you have in place (16,000 to 32,000).  This
overflow has been patched and a new snapshot of Maui has been released.
However, Maui has not been heavily tested at these queue sizes and you
may experience additional related issues in the future.  Many sites with
really large workloads are using Moab which has been tested with over
75,000 jobs and is being enabled for sites with workloads even larger.
Moab may save you a good deal of time.

  Best of luck and let us know if this addresses your problem.

Thanks,
Dave

On Thu, 2006-01-19 at 11:52 +0100, Simon Robbins wrote:
> Hello,
> 
> I'm running on a 64 bit (AMD Opteron) with CentOS 
> 3.5.3, Maui 3.2.6p14 and torque 2.0.0p5.  Although I 
> also saw this problem with Maui 3.2.6p13 compiled in 
> both 32 and 64 bit mode.
> 
> When a user runs `diagnose -p` the Maui scheduler 
> crashes.  I give the output from GDB here:
> 
> (gdb) bt
> #0  0x0000002a95884656 in _IO_default_xsputn_internal () from /lib64/tls/libc.so.6
> #1  0x0000002a9587981c in _IO_padn_internal () from /lib64/tls/libc.so.6
> #2  0x0000002a9585c22a in vfprintf () from /lib64/tls/libc.so.6
> #3  0x0000002a9587aa05 in vsprintf () from /lib64/tls/libc.so.6
> #4  0x0000002a95862efa in sprintf () from /lib64/tls/libc.so.6
> #5  0x00000000004c79e3 in MJobGetStartPriority (J=0x2a4680d0, PIndex=0, Priority=0x7fbff751b0, Mode=0,
>     Buffer=0x7fbff75c60 "diagnosing job priority information (partition: ALL)\n\nJob", ' ' <repeats 20 times>, "PRIORITY*   Cred(Class)    FS(Group)   Res(Proc)\n", ' ' <repeats 13 times>, "Weights   --------       1(    1)     1(  100)     1(   10)\n"...) at MPriority.c:1466
> #6  0x00000000004196e7 in UIDiagnosePriority (
>     Buffer=0x7fbff75c60 "diagnosing job priority information (partition: ALL)\n\nJob", ' ' <repeats 20 times>, "PRIORITY*   Cred(Class)    FS(Group)   Res(Proc)\n", ' ' <repeats 13 times>, "Weights   --------       1(    1)     1(  100)     1(   10)\n"..., BufSize=0x3fef1f0, P=0xd71040) at UserI.c:5260
> #7  0x000000000041af85 in UIDiagnose (RBuffer=0xb1e12ba "13 0 ALL [NONE]\n",
>     SBuffer=0x7fbff75c60 "diagnosing job priority information (partition: ALL)\n\nJob", ' ' <repeats 20 times>, "PRIORITY*   Cred(Class)    FS(Group)   Res( Proc)\n", ' ' <repeats 13 times>, "Weights   --------       1(    1)     1(  100)     1(   10)\n"..., FLAGS=5, Auth=0x7fbff753f0 "root",
>     SBufSize=0x3fef1f0) at UserI.c:6194
> #8  0x000000000040cbe5 in UIProcessCommand (S=0x2020202020202020) at OUserI.c:443
> #9  0x2e30202020293030 in ?? ()
> 
> ........... (many many similar lines) ...........
> 
> #38009 0x3732363734310a29 in ?? ()
> Cannot access memory at address 0x7fc0000000
> (gdb)
> 
> 
> IMPORTANT: I have changed the maximum number of jobs:
> include/msched.h: line 443
> #define MMAX_JOB          32768
> 
> (I've also seen this with 16384 jobs), without this 
> change I don't see this problem.
> 
> Best regards,
> 
> Simon Robbins.
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list