[torqueusers] segfaulting pbs_moms: torque-2.3.6-2cri.x86_64

Ken Nielson knielson at adaptivecomputing.com
Thu Nov 5 13:20:33 MST 2009


Dug,

I currently use both 32 bit and 64 bit machines in my cluster running 
2.3.x and 2.4.x. I have not had any problems except when using high 
availability because serverdb is created directly from memory so the 32 
bit and 64 bit machines do not create compatible images.

Did something change between 2.1.x and 2.3.x in the protocol?

I believe this may be a version incompatibility problem and not an 
architecture problem.

Ken Nielson
Adaptive Computing



Garrick Staples wrote:
> I used a mix of 32bit and 64bit pbs_moms for years.  It was never a problem.
>
> This is just another bug in the 2.3.x line.  The 2.1.x line is stable.
>
> On Thu, Nov 05, 2009 at 11:24:02AM -0500, Tom Pierce alleged:
>   
>> Dear Douglas,
>>
>> I had mixed 32 bit moms and 64 bit moms and it did not work well.  I
>> recovered by switching to a full 32 bit setup for Torque both pbs and
>> moms.  Later when the full architecture was 64 bit I moved up to 64
>> bit everywhere.
>>
>> my two cents.
>>
>> Tom
>>
>> On Wed, Nov 4, 2009 at 4:50 AM, Douglas McNab <d.mcnab at physics.gla.ac.uk> wrote:
>>     
>>> Hi,
>>>
>>> I have an issue with segfaulting mom's that seems correlated with the is
>>> server trying to ping it's moms.
>>> The server are version is torque-2.3.6-2cri.x86_64
>>> We are currently supporting two OS's through the same batch system using
>>> submit filter and node properties.   Therefore, we have two different
>>> versions of moms.
>>> Nodes 1->295 have moms torque-2.3.6-2cri.x86_64 and 296->309 have moms
>>> torque-2.1.9-4cri.slc4.i386
>>>
>>> When the moms segfault we see that the torque-2.1.9 moms stay up and only
>>> the torque-2.3.6 moms all die.
>>>
>>> I ran one of them through GDB and can see the call stack:
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> 0x000000000041813f in ?? ()
>>> (gdb) where
>>> #0  0x000000000041813f in ?? ()
>>> #1  0x000000000041985e in ?? ()
>>> #2  0x0000000000419a70 in ?? ()
>>> #3  0x0000000000416b97 in close_conn ()
>>> #4  0x0000000000416c52 in close_conn ()
>>> #5  0x00002b12d6cd7488 in wait_request () from /usr/lib64/libtorque.so.2
>>> #6  0x0000000000416e1d in close_conn ()
>>> #7  0x00000000004170e1 in close_conn ()
>>> #8  0x00002b12d6f2b974 in __libc_start_main () from /lib64/libc.so.6
>>> #9  0x0000000000405eb9 in close_conn ()
>>> #10 0x00007fff7565e368 in ?? ()
>>> #11 0x0000000000000000 in ?? ()
>>>
>>> Unfortunately this doesn't really give me any clues.
>>> Does anyone have any other ideas?
>>>
>>> Cheers,
>>>
>>> Dug
>>>
>>> --
>>> ScotGrid, Room 481, Kelvin Building, University of Glasgow
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>       
>>
>> -- 
>> -----------------------
>> Thanks
>>
>> Tom
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>     
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   



More information about the torqueusers mailing list