[torqueusers] segfaulting pbs_moms: torque-2.3.6-2cri.x86_64
thpierce at gmail.com
Thu Nov 5 09:24:02 MST 2009
I had mixed 32 bit moms and 64 bit moms and it did not work well. I
recovered by switching to a full 32 bit setup for Torque both pbs and
moms. Later when the full architecture was 64 bit I moved up to 64
my two cents.
On Wed, Nov 4, 2009 at 4:50 AM, Douglas McNab <d.mcnab at physics.gla.ac.uk> wrote:
> I have an issue with segfaulting mom's that seems correlated with the is
> server trying to ping it's moms.
> The server are version is torque-2.3.6-2cri.x86_64
> We are currently supporting two OS's through the same batch system using
> submit filter and node properties. Therefore, we have two different
> versions of moms.
> Nodes 1->295 have moms torque-2.3.6-2cri.x86_64 and 296->309 have moms
> When the moms segfault we see that the torque-2.1.9 moms stay up and only
> the torque-2.3.6 moms all die.
> I ran one of them through GDB and can see the call stack:
> Program received signal SIGSEGV, Segmentation fault.
> 0x000000000041813f in ?? ()
> (gdb) where
> #0 0x000000000041813f in ?? ()
> #1 0x000000000041985e in ?? ()
> #2 0x0000000000419a70 in ?? ()
> #3 0x0000000000416b97 in close_conn ()
> #4 0x0000000000416c52 in close_conn ()
> #5 0x00002b12d6cd7488 in wait_request () from /usr/lib64/libtorque.so.2
> #6 0x0000000000416e1d in close_conn ()
> #7 0x00000000004170e1 in close_conn ()
> #8 0x00002b12d6f2b974 in __libc_start_main () from /lib64/libc.so.6
> #9 0x0000000000405eb9 in close_conn ()
> #10 0x00007fff7565e368 in ?? ()
> #11 0x0000000000000000 in ?? ()
> Unfortunately this doesn't really give me any clues.
> Does anyone have any other ideas?
> ScotGrid, Room 481, Kelvin Building, University of Glasgow
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers