[torqueusers] segfaulting pbs_moms: torque-2.3.6-2cri.x86_64

Douglas McNab d.mcnab at physics.gla.ac.uk
Wed Nov 4 02:50:34 MST 2009


Hi,

I have an issue with segfaulting mom's that seems correlated with the is
server trying to ping it's moms.
The server are version is torque-2.3.6-2cri.x86_64
We are currently supporting two OS's through the same batch system using
submit filter and node properties.   Therefore, we have two different
versions of moms.
Nodes 1->295 have moms torque-2.3.6-2cri.x86_64 and 296->309 have moms
torque-2.1.9-4cri.slc4.i386

When the moms segfault we see that the torque-2.1.9 moms stay up and only
the torque-2.3.6 moms all die.

I ran one of them through GDB and can see the call stack:

Program received signal SIGSEGV, Segmentation fault.
0x000000000041813f in ?? ()
(gdb) where
#0  0x000000000041813f in ?? ()
#1  0x000000000041985e in ?? ()
#2  0x0000000000419a70 in ?? ()
#3  0x0000000000416b97 in close_conn ()
#4  0x0000000000416c52 in close_conn ()
#5  0x00002b12d6cd7488 in wait_request () from /usr/lib64/libtorque.so.2
#6  0x0000000000416e1d in close_conn ()
#7  0x00000000004170e1 in close_conn ()
#8  0x00002b12d6f2b974 in __libc_start_main () from /lib64/libc.so.6
#9  0x0000000000405eb9 in close_conn ()
#10 0x00007fff7565e368 in ?? ()
#11 0x0000000000000000 in ?? ()

Unfortunately this doesn't really give me any clues.
Does anyone have any other ideas?

Cheers,

Dug

-- 
ScotGrid, Room 481, Kelvin Building, University of Glasgow
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20091104/d67281a6/attachment-0001.html 


More information about the torqueusers mailing list