[torqueusers] RE: torqueusers Digest, Vol 38, Issue 12

Moody, Tristan tmoody at ku.edu
Mon Sep 10 11:05:19 MDT 2007


A gdb backtrace gives the following:

(gdb) run
Starting program: /usr/local/bin/qstat

Program received signal SIGFPE, Arithmetic exception.
0x0000003c08f088a8 in ?? ()
(gdb) bt
#0  0x0000003c08f088a8 in ?? ()
#1  0x00007fffffbf3a20 in ?? ()
#2  0x00007fffffbf38e0 in ?? ()
#3  0x00007fffffbf38d0 in ?? ()
#4  0x0000003c0910b858 in ?? ()
#5  0x00000000000668c3 in ?? ()
#6  0x0000003c09111dc0 in ?? ()
#7  0x0000000000000000 in ?? ()
(gdb)



-----Original Message-----
From: torqueusers-bounces at supercluster.org on behalf of torqueusers-request at supercluster.org
Sent: Sun 9/9/2007 1:00 PM
To: torqueusers at supercluster.org
Subject: torqueusers Digest, Vol 38, Issue 12
 
Send torqueusers mailing list submissions to
	torqueusers at supercluster.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://www.supercluster.org/mailman/listinfo/torqueusers
or, via email, send a message with subject or body 'help' to
	torqueusers-request at supercluster.org

You can reach the person managing the list at
	torqueusers-owner at supercluster.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of torqueusers digest..."


Today's Topics:

   1. Floating Point exception for pbs_mom (Moody, Tristan)
   2. Re: Floating Point exception for pbs_mom (Garrick Staples)


----------------------------------------------------------------------

Message: 1
Date: Sat, 8 Sep 2007 22:15:09 -0500
From: "Moody, Tristan" <tmoody at ku.edu>
Subject: [torqueusers] Floating Point exception for pbs_mom
To: <torqueusers at supercluster.org>
Message-ID:
	<515CE52A7472B549AFE9F121A619F98894089D at MAILBOXTHREE.home.ku.edu>
Content-Type: text/plain;	charset="iso-8859-1"


 
After weeks of running with no problems, I discovered yesterday that about half of the nodes on the cluster I operate were listed as "down."  After ssh-ing directly into the nodes to investigate, I discovered that any torque-related software I tried to run on those nodes returned a floating-point exception.  Today I discover that this problem is happening on ALL of the nodes--I cannot get pbs_mom started on any of the compute nodes.  The compute nodes are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8

Here's an excerpt from dmesg on one of the nodes:

pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10 error:0
pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30 error:0
pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0 error:0
momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60 error:0
qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0


Any ideas on what exactly is going wrong?  This had been running fine until yesterday, and there have been no changes to the system in the past couple weeks.


Tristan Moody






------------------------------

Message: 2
Date: Sun, 9 Sep 2007 09:06:59 -0700
From: Garrick Staples <garrick at usc.edu>
Subject: Re: [torqueusers] Floating Point exception for pbs_mom
To: torqueusers at supercluster.org
Message-ID: <20070909160659.GC19043 at polop.usc.edu>
Content-Type: text/plain; charset="us-ascii"

On Sat, Sep 08, 2007 at 10:15:09PM -0500, Moody, Tristan alleged:
> 
>  
> After weeks of running with no problems, I discovered yesterday that about half of the nodes on the cluster I operate were listed as "down."  After ssh-ing directly into the nodes to investigate, I discovered that any torque-related software I tried to run on those nodes returned a floating-point exception.  Today I discover that this problem is happening on ALL of the nodes--I cannot get pbs_mom started on any of the compute nodes.  The compute nodes are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8
> 
> Here's an excerpt from dmesg on one of the nodes:
> 
> pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10 error:0
> pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30 error:0
> pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0 error:0
> momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60 error:0
> qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0
> 
> 
> Any ideas on what exactly is going wrong?  This had been running fine until yesterday, and there have been no changes to the system in the past couple weeks.
> 

Can you get a gdb backtrace?

$ gdb qstat
...
(gdb) run
(gdb) bt

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20070909/5403a19b/attachment-0001.bin

------------------------------

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


End of torqueusers Digest, Vol 38, Issue 12
*******************************************


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 4781 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20070910/0d6a6315/attachment.bin


More information about the torqueusers mailing list