[torqueusers] segfaulting pbs_moms: torque-2.3.6-2cri.x86_64

Douglas McNab d.mcnab at physics.gla.ac.uk
Thu Nov 12 08:36:51 MST 2009


Hi Folks,

Thanks for all your replies.  I have thought that mixing versions was a
little unsafe.  However,  I am a little confused why they can work together
for a period of time and then decided to segfault when the server pings the
mom's.  So to find an explantion I have built a debug build.  After
debugging my segfaulting moms torque-2.3.6-2cri.x86_64 further with a debug
build I seem to move a little closer to the problem.

Program received signal SIGSEGV, Segmentation fault.
mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450
450           ipaddr = ntohl(addr->sin_addr.s_addr);
(gdb) where
#0  mom_server_find_by_ip (search_ipaddr=177078032) at mom_server.c:450
#1  0x000000000041965e in mom_server_valid_message_source (stream=0) at
mom_server.c:2022
#2  0x0000000000419870 in is_request (stream=0, version=1,
cmdp=0x7fffcb2774d8) at mom_server.c:2125
#3  0x0000000000416997 in do_rpp (stream=0) at mom_main.c:5351
#4  0x0000000000416a52 in rpp_request (fd=<value optimized out>) at
mom_main.c:5408
#5  0x00002ae6c4678bc8 in wait_request (waittime=<value optimized out>,
SState=0x0) at ../Libnet/net_server.c:469
#6  0x0000000000416c1d in main_loop () at mom_main.c:8046
#7  0x0000000000416ee1 in main (argc=1, argv=0x7fffcb277bc8) at
mom_main.c:8148
(gdb) print ipaddr
No symbol "ipaddr" in current context.
(gdb) print addr
$1 = <value optimized out>
(gdb) print addr->sin_addr.s_addr
Cannot access memory at address 0x4
(gdb) print * addr
Cannot access memory at address 0x0
(gdb) frame 1
#1  0x000000000041965e in mom_server_valid_message_source (stream=0) at
mom_server.c:2022
2022      if ((pms = mom_server_find_by_ip(ipaddr)))
(gdb) print ipaddr
No symbol "ipaddr" in current context.
(gdb)

It appears that the addr is null which is slightly confusing.  Does anyone
have detailed knowledge of the source, enough to comment on this?

Cheers,

Dug

2009/11/5 Ken Nielson <knielson at adaptivecomputing.com>

> Dug,
>
> I currently use both 32 bit and 64 bit machines in my cluster running 2.3.x
> and 2.4.x. I have not had any problems except when using high availability
> because serverdb is created directly from memory so the 32 bit and 64 bit
> machines do not create compatible images.
>
> Did something change between 2.1.x and 2.3.x in the protocol?
>
> I believe this may be a version incompatibility problem and not an
> architecture problem.
>
> Ken Nielson
> Adaptive Computing
>
>
>
> Garrick Staples wrote:
>
>> I used a mix of 32bit and 64bit pbs_moms for years.  It was never a
>> problem.
>>
>> This is just another bug in the 2.3.x line.  The 2.1.x line is stable.
>>
>> On Thu, Nov 05, 2009 at 11:24:02AM -0500, Tom Pierce alleged:
>>
>>
>>> Dear Douglas,
>>>
>>> I had mixed 32 bit moms and 64 bit moms and it did not work well.  I
>>> recovered by switching to a full 32 bit setup for Torque both pbs and
>>> moms.  Later when the full architecture was 64 bit I moved up to 64
>>> bit everywhere.
>>>
>>> my two cents.
>>>
>>> Tom
>>>
>>> On Wed, Nov 4, 2009 at 4:50 AM, Douglas McNab <d.mcnab at physics.gla.ac.uk>
>>> wrote:
>>>
>>>
>>>> Hi,
>>>>
>>>> I have an issue with segfaulting mom's that seems correlated with the is
>>>> server trying to ping it's moms.
>>>> The server are version is torque-2.3.6-2cri.x86_64
>>>> We are currently supporting two OS's through the same batch system using
>>>> submit filter and node properties.   Therefore, we have two different
>>>> versions of moms.
>>>> Nodes 1->295 have moms torque-2.3.6-2cri.x86_64 and 296->309 have moms
>>>> torque-2.1.9-4cri.slc4.i386
>>>>
>>>> When the moms segfault we see that the torque-2.1.9 moms stay up and
>>>> only
>>>> the torque-2.3.6 moms all die.
>>>>
>>>> I ran one of them through GDB and can see the call stack:
>>>>
>>>> Program received signal SIGSEGV, Segmentation fault.
>>>> 0x000000000041813f in ?? ()
>>>> (gdb) where
>>>> #0  0x000000000041813f in ?? ()
>>>> #1  0x000000000041985e in ?? ()
>>>> #2  0x0000000000419a70 in ?? ()
>>>> #3  0x0000000000416b97 in close_conn ()
>>>> #4  0x0000000000416c52 in close_conn ()
>>>> #5  0x00002b12d6cd7488 in wait_request () from /usr/lib64/libtorque.so.2
>>>> #6  0x0000000000416e1d in close_conn ()
>>>> #7  0x00000000004170e1 in close_conn ()
>>>> #8  0x00002b12d6f2b974 in __libc_start_main () from /lib64/libc.so.6
>>>> #9  0x0000000000405eb9 in close_conn ()
>>>> #10 0x00007fff7565e368 in ?? ()
>>>> #11 0x0000000000000000 in ?? ()
>>>>
>>>> Unfortunately this doesn't really give me any clues.
>>>> Does anyone have any other ideas?
>>>>
>>>> Cheers,
>>>>
>>>> Dug
>>>>
>>>> --
>>>> ScotGrid, Room 481, Kelvin Building, University of Glasgow
>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> -----------------------
>>> Thanks
>>>
>>> Tom
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>  ------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>


-- 
ScotGrid, Room 481, Kelvin Building, University of Glasgow
tel: +44(0)141 330 6439
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20091112/0e7afb5c/attachment-0001.html 


More information about the torqueusers mailing list