[torqueusers] RE: torqueusers Digest, Vol 38, Issue 12
Aaron Knister
aaron at iges.org
Mon Sep 10 11:18:03 MDT 2007
You're absolutely 100% sure no software changes have been made? Try
tailing /var/log/yum.log and up2date.log and see if any software
updates have occurred. It's unusual for software to spontaneously
break in this manner ( but not as unusual as we'd all like to think ).
-Aaron
On Sep 10, 2007, at 1:05 PM, Moody, Tristan wrote:
> A gdb backtrace gives the following:
>
> (gdb) run
> Starting program: /usr/local/bin/qstat
>
> Program received signal SIGFPE, Arithmetic exception.
> 0x0000003c08f088a8 in ?? ()
> (gdb) bt
> #0 0x0000003c08f088a8 in ?? ()
> #1 0x00007fffffbf3a20 in ?? ()
> #2 0x00007fffffbf38e0 in ?? ()
> #3 0x00007fffffbf38d0 in ?? ()
> #4 0x0000003c0910b858 in ?? ()
> #5 0x00000000000668c3 in ?? ()
> #6 0x0000003c09111dc0 in ?? ()
> #7 0x0000000000000000 in ?? ()
> (gdb)
>
>
>
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org on behalf of torqueusers-
> request at supercluster.org
> Sent: Sun 9/9/2007 1:00 PM
> To: torqueusers at supercluster.org
> Subject: torqueusers Digest, Vol 38, Issue 12
>
> Send torqueusers mailing list submissions to
> torqueusers at supercluster.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.supercluster.org/mailman/listinfo/torqueusers
> or, via email, send a message with subject or body 'help' to
> torqueusers-request at supercluster.org
>
> You can reach the person managing the list at
> torqueusers-owner at supercluster.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of torqueusers digest..."
>
>
> Today's Topics:
>
> 1. Floating Point exception for pbs_mom (Moody, Tristan)
> 2. Re: Floating Point exception for pbs_mom (Garrick Staples)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 8 Sep 2007 22:15:09 -0500
> From: "Moody, Tristan" <tmoody at ku.edu>
> Subject: [torqueusers] Floating Point exception for pbs_mom
> To: <torqueusers at supercluster.org>
> Message-ID:
> <515CE52A7472B549AFE9F121A619F98894089D at MAILBOXTHREE.home.ku.edu>
> Content-Type: text/plain; charset="iso-8859-1"
>
>
>
> After weeks of running with no problems, I discovered yesterday
> that about half of the nodes on the cluster I operate were listed
> as "down." After ssh-ing directly into the nodes to investigate, I
> discovered that any torque-related software I tried to run on those
> nodes returned a floating-point exception. Today I discover that
> this problem is happening on ALL of the nodes--I cannot get pbs_mom
> started on any of the compute nodes. The compute nodes are running
> kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8
>
> Here's an excerpt from dmesg on one of the nodes:
>
> pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10
> error:0
> pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30
> error:0
> pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0
> error:0
> momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60 error:0
> qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0
>
>
> Any ideas on what exactly is going wrong? This had been running
> fine until yesterday, and there have been no changes to the system
> in the past couple weeks.
>
>
> Tristan Moody
>
>
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Sun, 9 Sep 2007 09:06:59 -0700
> From: Garrick Staples <garrick at usc.edu>
> Subject: Re: [torqueusers] Floating Point exception for pbs_mom
> To: torqueusers at supercluster.org
> Message-ID: <20070909160659.GC19043 at polop.usc.edu>
> Content-Type: text/plain; charset="us-ascii"
>
> On Sat, Sep 08, 2007 at 10:15:09PM -0500, Moody, Tristan alleged:
>>
>>
>> After weeks of running with no problems, I discovered yesterday
>> that about half of the nodes on the cluster I operate were listed
>> as "down." After ssh-ing directly into the nodes to investigate,
>> I discovered that any torque-related software I tried to run on
>> those nodes returned a floating-point exception. Today I discover
>> that this problem is happening on ALL of the nodes--I cannot get
>> pbs_mom started on any of the compute nodes. The compute nodes
>> are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version
>> 2.1.8
>>
>> Here's an excerpt from dmesg on one of the nodes:
>>
>> pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10
>> error:0
>> pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30
>> error:0
>> pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0
>> error:0
>> momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60
>> error:0
>> qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0
>>
>>
>> Any ideas on what exactly is going wrong? This had been running
>> fine until yesterday, and there have been no changes to the system
>> in the past couple weeks.
>>
>
> Can you get a gdb backtrace?
>
> $ gdb qstat
> ...
> (gdb) run
> (gdb) bt
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: application/pgp-signature
> Size: 189 bytes
> Desc: not available
> Url : http://www.supercluster.org/pipermail/torqueusers/attachments/
> 20070909/5403a19b/attachment-0001.bin
>
> ------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> End of torqueusers Digest, Vol 38, Issue 12
> *******************************************
>
>
> <winmail.dat>_______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water
(301) 595-7001
aaron at iges.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070910/4678e949/attachment-0001.html
More information about the torqueusers
mailing list