[torqueusers] RE: torqueusers Digest, Vol 38, Issue 12

Aaron Knister aaron at iges.org
Mon Sep 10 11:18:03 MDT 2007


You're absolutely 100% sure no software changes have been made? Try  
tailing /var/log/yum.log and up2date.log and see if any software  
updates have occurred. It's unusual for software to spontaneously  
break in this manner ( but not as unusual as we'd all like to think ).

-Aaron

On Sep 10, 2007, at 1:05 PM, Moody, Tristan wrote:

> A gdb backtrace gives the following:
>
> (gdb) run
> Starting program: /usr/local/bin/qstat
>
> Program received signal SIGFPE, Arithmetic exception.
> 0x0000003c08f088a8 in ?? ()
> (gdb) bt
> #0  0x0000003c08f088a8 in ?? ()
> #1  0x00007fffffbf3a20 in ?? ()
> #2  0x00007fffffbf38e0 in ?? ()
> #3  0x00007fffffbf38d0 in ?? ()
> #4  0x0000003c0910b858 in ?? ()
> #5  0x00000000000668c3 in ?? ()
> #6  0x0000003c09111dc0 in ?? ()
> #7  0x0000000000000000 in ?? ()
> (gdb)
>
>
>
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org on behalf of torqueusers- 
> request at supercluster.org
> Sent: Sun 9/9/2007 1:00 PM
> To: torqueusers at supercluster.org
> Subject: torqueusers Digest, Vol 38, Issue 12
>
> Send torqueusers mailing list submissions to
> 	torqueusers at supercluster.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://www.supercluster.org/mailman/listinfo/torqueusers
> or, via email, send a message with subject or body 'help' to
> 	torqueusers-request at supercluster.org
>
> You can reach the person managing the list at
> 	torqueusers-owner at supercluster.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of torqueusers digest..."
>
>
> Today's Topics:
>
>    1. Floating Point exception for pbs_mom (Moody, Tristan)
>    2. Re: Floating Point exception for pbs_mom (Garrick Staples)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 8 Sep 2007 22:15:09 -0500
> From: "Moody, Tristan" <tmoody at ku.edu>
> Subject: [torqueusers] Floating Point exception for pbs_mom
> To: <torqueusers at supercluster.org>
> Message-ID:
> 	<515CE52A7472B549AFE9F121A619F98894089D at MAILBOXTHREE.home.ku.edu>
> Content-Type: text/plain;	charset="iso-8859-1"
>
>
>
> After weeks of running with no problems, I discovered yesterday  
> that about half of the nodes on the cluster I operate were listed  
> as "down."  After ssh-ing directly into the nodes to investigate, I  
> discovered that any torque-related software I tried to run on those  
> nodes returned a floating-point exception.  Today I discover that  
> this problem is happening on ALL of the nodes--I cannot get pbs_mom  
> started on any of the compute nodes.  The compute nodes are running  
> kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8
>
> Here's an excerpt from dmesg on one of the nodes:
>
> pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10  
> error:0
> pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30  
> error:0
> pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0  
> error:0
> momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60 error:0
> qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0
>
>
> Any ideas on what exactly is going wrong?  This had been running  
> fine until yesterday, and there have been no changes to the system  
> in the past couple weeks.
>
>
> Tristan Moody
>
>
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Sun, 9 Sep 2007 09:06:59 -0700
> From: Garrick Staples <garrick at usc.edu>
> Subject: Re: [torqueusers] Floating Point exception for pbs_mom
> To: torqueusers at supercluster.org
> Message-ID: <20070909160659.GC19043 at polop.usc.edu>
> Content-Type: text/plain; charset="us-ascii"
>
> On Sat, Sep 08, 2007 at 10:15:09PM -0500, Moody, Tristan alleged:
>>
>>
>> After weeks of running with no problems, I discovered yesterday  
>> that about half of the nodes on the cluster I operate were listed  
>> as "down."  After ssh-ing directly into the nodes to investigate,  
>> I discovered that any torque-related software I tried to run on  
>> those nodes returned a floating-point exception.  Today I discover  
>> that this problem is happening on ALL of the nodes--I cannot get  
>> pbs_mom started on any of the compute nodes.  The compute nodes  
>> are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version  
>> 2.1.8
>>
>> Here's an excerpt from dmesg on one of the nodes:
>>
>> pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10  
>> error:0
>> pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30  
>> error:0
>> pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0  
>> error:0
>> momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60  
>> error:0
>> qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0
>>
>>
>> Any ideas on what exactly is going wrong?  This had been running  
>> fine until yesterday, and there have been no changes to the system  
>> in the past couple weeks.
>>
>
> Can you get a gdb backtrace?
>
> $ gdb qstat
> ...
> (gdb) run
> (gdb) bt
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: application/pgp-signature
> Size: 189 bytes
> Desc: not available
> Url : http://www.supercluster.org/pipermail/torqueusers/attachments/ 
> 20070909/5403a19b/attachment-0001.bin
>
> ------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> End of torqueusers Digest, Vol 38, Issue 12
> *******************************************
>
>
> <winmail.dat>_______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070910/4678e949/attachment-0001.html


More information about the torqueusers mailing list