[Mauiusers] Maui 3.3.1 segfaults on MPBSNodeUpdate

Marco Perosa marco.perosa at gmail.com
Mon Aug 13 13:18:16 MDT 2012


Hi,

I have a problem with Maui version 3.3.1, used in conjunction with
Torque version 2.5.9.

The problem seems to occur only when a large number of jobs (also
completed ones) is in the queue/state of a single node, in fact the
ouput of 'qnodes node06' (the node in question in this case) is very,
very large ('qnodes node06 | wc -m' ---> 510372).

On the cluster that I administer one of the users usually launches a
very large job array (2000 IDs), but since its execution is very fast
it could happen that all of the IDs are executed on a single node,
while other nodes are occupied by different jobs that take more time
to complete. This is why the situation described above could happen.

This is the debug of the crash:

dmesg:
[262129.550823] maui[30562]: segfault at 7ffffffff000 ip
00007ffff7602845 sp 00007ffffffdbe68 error 6 in
libc-2.11.3.so[7ffff74f8000+159000]

log:
07/27 17:07:30 INFO:     PBS node node06 set to state Idle (free)
07/27 17:07:30 MNodeFind(node06,N)
07/27 17:07:30 MRMNodePreUpdate(node06,Idle,BUNET)
07/27 17:07:30 MPBSNodeUpdate(node06,node06,Idle,BUNET)
07/27 17:07:30 __MPBSIGetSSSStatus(node06,rectime=1343401619,varattr=,jobs=,state=free,netload=142294541287,gres=,loadave=0.00,ncpus=8,physmem=4059908kb,availmem=4991168kb,totmem=5104124kb,idletime=262568,nusers=0,nsessions=?
0,sessions=? 0,uname=Linux node06 2.6.32-5-amd64 #1 SMP Sun May 6
04:00:17 UTC 2012 x86_64,opsys=linux)

gdb:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7602883 in ?? () from /lib/libc.so.6
(gdb) where
#0  0x00007ffff7602883 in ?? () from /lib/libc.so.6
#1  0x00000000004a46b9 in MPBSNodeUpdate (N=0x2345da0,
    PNode=<value optimized out>, NState=<value optimized out>,
    R=<value optimized out>) at MPBSI.c:3171
#2  0x2b72657473616d40 in ?? ()
#3  0x6a392b32312b3230 in ?? ()
...

I think some size limit of one of the values involved is responsible,
but I'm not sure what would be the right way to avoid this problem.

Thank you for any help you may provide.


Ciao,
Marco


More information about the mauiusers mailing list