[torqueusers] RE: pbs_mom gives floating-point exception on startup

Aaron Knister aaron at iges.org
Mon Sep 10 11:39:08 MDT 2007


I wouldn't worry about the lost ticks error. I get it on my systems  
too and they run fine.

As far as the log files go I gave you the wrong file for up2date.  
It's /var/log/up2date and they get rotated periodically so there may  
be up2date.[1-4] . Check those and see what you can find.

-Aaron

On Sep 10, 2007, at 1:28 PM, Moody, Tristan wrote:

> yum.log is empty, and there is no up2date.log.
>
> dmesg now shows this line as well (dunno if there's any relation):
> Losing some ticks... checking if CPU frequency changed.
>
> ------------------------------------------------
>
> Message: 5
> Date: Mon, 10 Sep 2007 12:05:19 -0500
> From: "Moody, Tristan" <tmoody at ku.edu>
> Subject: [torqueusers] RE: torqueusers Digest, Vol 38, Issue 12
> To: <torqueusers at supercluster.org>
> Message-ID:
> 	<515CE52A7472B549AFE9F121A619F98894089E at MAILBOXTHREE.home.ku.edu>
> Content-Type: text/plain; charset="iso-8859-1"
>
> A gdb backtrace gives the following:
>
> (gdb) run
> Starting program: /usr/local/bin/qstat
>
> Program received signal SIGFPE, Arithmetic exception.
> 0x0000003c08f088a8 in ?? ()
> (gdb) bt
> #0  0x0000003c08f088a8 in ?? ()
> #1  0x00007fffffbf3a20 in ?? ()
> #2  0x00007fffffbf38e0 in ?? ()
> #3  0x00007fffffbf38d0 in ?? ()
> #4  0x0000003c0910b858 in ?? ()
> #5  0x00000000000668c3 in ?? ()
> #6  0x0000003c09111dc0 in ?? ()
> #7  0x0000000000000000 in ?? ()
> (gdb)
>
>
>
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org on behalf of torqueusers- 
> request at supercluster.org
> Sent: Sun 9/9/2007 1:00 PM
> To: torqueusers at supercluster.org
> Subject: torqueusers Digest, Vol 38, Issue 12
>
> Send torqueusers mailing list submissions to
> 	torqueusers at supercluster.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://www.supercluster.org/mailman/listinfo/torqueusers
> or, via email, send a message with subject or body 'help' to
> 	torqueusers-request at supercluster.org
>
> You can reach the person managing the list at
> 	torqueusers-owner at supercluster.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of torqueusers digest..."
>
>
> Today's Topics:
>
>    1. Floating Point exception for pbs_mom (Moody, Tristan)
>    2. Re: Floating Point exception for pbs_mom (Garrick Staples)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 8 Sep 2007 22:15:09 -0500
> From: "Moody, Tristan" <tmoody at ku.edu>
> Subject: [torqueusers] Floating Point exception for pbs_mom
> To: <torqueusers at supercluster.org>
> Message-ID:
> 	<515CE52A7472B549AFE9F121A619F98894089D at MAILBOXTHREE.home.ku.edu>
> Content-Type: text/plain;	charset="iso-8859-1"
>
>
>
> After weeks of running with no problems, I discovered yesterday  
> that about half of the nodes on the cluster I operate were listed  
> as "down."  After ssh-ing directly into the nodes to investigate, I  
> discovered that any torque-related software I tried to run on those  
> nodes returned a floating-point exception.  Today I discover that  
> this problem is happening on ALL of the nodes--I cannot get pbs_mom  
> started on any of the compute nodes.  The compute nodes are running  
> kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8
>
> Here's an excerpt from dmesg on one of the nodes:
>
> pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10  
> error:0
> pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30  
> error:0
> pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0  
> error:0
> momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60 error:0
> qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0
>
>
> Any ideas on what exactly is going wrong?  This had been running  
> fine until yesterday, and there have been no changes to the system  
> in the past couple weeks.
>
>
> Tristan Moody
>
>
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Sun, 9 Sep 2007 09:06:59 -0700
> From: Garrick Staples <garrick at usc.edu>
> Subject: Re: [torqueusers] Floating Point exception for pbs_mom
> To: torqueusers at supercluster.org
> Message-ID: <20070909160659.GC19043 at polop.usc.edu>
> Content-Type: text/plain; charset="us-ascii"
>
> On Sat, Sep 08, 2007 at 10:15:09PM -0500, Moody, Tristan alleged:
>>
>>
>> After weeks of running with no problems, I discovered yesterday  
>> that about half of the nodes on the cluster I operate were listed  
>> as "down."  After ssh-ing directly into the nodes to investigate,  
>> I discovered that any torque-related software I tried to run on  
>> those nodes returned a floating-point exception.  Today I discover  
>> that this problem is happening on ALL of the nodes--I cannot get  
>> pbs_mom started on any of the compute nodes.  The compute nodes  
>> are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version  
>> 2.1.8
>>
>> Here's an excerpt from dmesg on one of the nodes:
>>
>> pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10  
>> error:0
>> pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30  
>> error:0
>> pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0  
>> error:0
>> momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60  
>> error:0
>> qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0
>>
>>
>> Any ideas on what exactly is going wrong?  This had been running  
>> fine until yesterday, and there have been no changes to the system  
>> in the past couple weeks.
>>
>
> Can you get a gdb backtrace?
>
> $ gdb qstat
> ...
> (gdb) run
> (gdb) bt
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: application/pgp-signature
> Size: 189 bytes
> Desc: not available
> Url : http://www.supercluster.org/pipermail/torqueusers/attachments/ 
> 20070909/5403a19b/attachment-0001.bin
>
> ------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> End of torqueusers Digest, Vol 38, Issue 12
> *******************************************
>
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: application/ms-tnef
> Size: 4781 bytes
> Desc: not available
> Url : http://www.supercluster.org/pipermail/torqueusers/attachments/ 
> 20070910/0d6a6315/attachment-0001.bin
>
> ------------------------------
>
> Message: 6
> Date: Mon, 10 Sep 2007 13:18:03 -0400
> From: Aaron Knister <aaron at iges.org>
> Subject: Re: [torqueusers] RE: torqueusers Digest, Vol 38, Issue 12
> To: "Moody, Tristan" <tmoody at ku.edu>
> Cc: torqueusers at supercluster.org
> Message-ID: <DBAAF280-1A36-4C1A-865E-709C99DC69CB at iges.org>
> Content-Type: text/plain; charset="us-ascii"
>
> You're absolutely 100% sure no software changes have been made? Try
> tailing /var/log/yum.log and up2date.log and see if any software
> updates have occurred. It's unusual for software to spontaneously
> break in this manner ( but not as unusual as we'd all like to think ).
>
> -Aaron
>
> On Sep 10, 2007, at 1:05 PM, Moody, Tristan wrote:
>
>> A gdb backtrace gives the following:
>>
>> (gdb) run
>> Starting program: /usr/local/bin/qstat
>>
>> Program received signal SIGFPE, Arithmetic exception.
>> 0x0000003c08f088a8 in ?? ()
>> (gdb) bt
>> #0  0x0000003c08f088a8 in ?? ()
>> #1  0x00007fffffbf3a20 in ?? ()
>> #2  0x00007fffffbf38e0 in ?? ()
>> #3  0x00007fffffbf38d0 in ?? ()
>> #4  0x0000003c0910b858 in ?? ()
>> #5  0x00000000000668c3 in ?? ()
>> #6  0x0000003c09111dc0 in ?? ()
>> #7  0x0000000000000000 in ?? ()
>> (gdb)
>>
>>
>>
>> -----Original Message-----
>> From: torqueusers-bounces at supercluster.org on behalf of torqueusers-
>> request at supercluster.org
>> Sent: Sun 9/9/2007 1:00 PM
>> To: torqueusers at supercluster.org
>> Subject: torqueusers Digest, Vol 38, Issue 12
>>
>> Send torqueusers mailing list submissions to
>> 	torqueusers at supercluster.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> 	http://www.supercluster.org/mailman/listinfo/torqueusers
>> or, via email, send a message with subject or body 'help' to
>> 	torqueusers-request at supercluster.org
>>
>> You can reach the person managing the list at
>> 	torqueusers-owner at supercluster.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of torqueusers digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Floating Point exception for pbs_mom (Moody, Tristan)
>>    2. Re: Floating Point exception for pbs_mom (Garrick Staples)
>>
>>
>> --------------------------------------------------------------------- 
>> -
>>
>> Message: 1
>> Date: Sat, 8 Sep 2007 22:15:09 -0500
>> From: "Moody, Tristan" <tmoody at ku.edu>
>> Subject: [torqueusers] Floating Point exception for pbs_mom
>> To: <torqueusers at supercluster.org>
>> Message-ID:
>> 	<515CE52A7472B549AFE9F121A619F98894089D at MAILBOXTHREE.home.ku.edu>
>> Content-Type: text/plain;	charset="iso-8859-1"
>>
>>
>>
>> After weeks of running with no problems, I discovered yesterday
>> that about half of the nodes on the cluster I operate were listed
>> as "down."  After ssh-ing directly into the nodes to investigate, I
>> discovered that any torque-related software I tried to run on those
>> nodes returned a floating-point exception.  Today I discover that
>> this problem is happening on ALL of the nodes--I cannot get pbs_mom
>> started on any of the compute nodes.  The compute nodes are running
>> kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8
>>
>> Here's an excerpt from dmesg on one of the nodes:
>>
>> pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10
>> error:0
>> pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30
>> error:0
>> pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0
>> error:0
>> momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60  
>> error:0
>> qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0
>>
>>
>> Any ideas on what exactly is going wrong?  This had been running
>> fine until yesterday, and there have been no changes to the system
>> in the past couple weeks.
>>
>>
>> Tristan Moody
>>
>>
>>
>>
>>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Sun, 9 Sep 2007 09:06:59 -0700
>> From: Garrick Staples <garrick at usc.edu>
>> Subject: Re: [torqueusers] Floating Point exception for pbs_mom
>> To: torqueusers at supercluster.org
>> Message-ID: <20070909160659.GC19043 at polop.usc.edu>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> On Sat, Sep 08, 2007 at 10:15:09PM -0500, Moody, Tristan alleged:
>>>
>>>
>>> After weeks of running with no problems, I discovered yesterday
>>> that about half of the nodes on the cluster I operate were listed
>>> as "down."  After ssh-ing directly into the nodes to investigate,
>>> I discovered that any torque-related software I tried to run on
>>> those nodes returned a floating-point exception.  Today I discover
>>> that this problem is happening on ALL of the nodes--I cannot get
>>> pbs_mom started on any of the compute nodes.  The compute nodes
>>> are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version
>>> 2.1.8
>>>
>>> Here's an excerpt from dmesg on one of the nodes:
>>>
>>> pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10
>>> error:0
>>> pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30
>>> error:0
>>> pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0
>>> error:0
>>> momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60
>>> error:0
>>> qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0  
>>> error:0
>>>
>>>
>>> Any ideas on what exactly is going wrong?  This had been running
>>> fine until yesterday, and there have been no changes to the system
>>> in the past couple weeks.
>>>
>>
>> Can you get a gdb backtrace?
>>
>> $ gdb qstat
>> ...
>> (gdb) run
>> (gdb) bt
>>
>> -------------- next part --------------
>> A non-text attachment was scrubbed...
>> Name: not available
>> Type: application/pgp-signature
>> Size: 189 bytes
>> Desc: not available
>> Url : http://www.supercluster.org/pipermail/torqueusers/attachments/
>> 20070909/5403a19b/attachment-0001.bin
>>
>> ------------------------------
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>> End of torqueusers Digest, Vol 38, Issue 12
>> *******************************************
>>
>>
>> <winmail.dat>_______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> Aaron Knister
> Associate Systems Administrator/Web Designer
> Center for Research on Environment and Water
>
> (301) 595-7001
> aaron at iges.org
>
>
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://www.supercluster.org/pipermail/torqueusers/attachments/ 
> 20070910/4678e949/attachment.html
>
> ------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
> End of torqueusers Digest, Vol 38, Issue 13
> *******************************************
>
> <winmail.dat>_______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070910/8d238d92/attachment-0001.html


More information about the torqueusers mailing list