[torqueusers] Floating Point exception for pbs_mom

Garrick Staples garrick at usc.edu
Sun Sep 9 10:06:59 MDT 2007


On Sat, Sep 08, 2007 at 10:15:09PM -0500, Moody, Tristan alleged:
> 
>  
> After weeks of running with no problems, I discovered yesterday that about half of the nodes on the cluster I operate were listed as "down."  After ssh-ing directly into the nodes to investigate, I discovered that any torque-related software I tried to run on those nodes returned a floating-point exception.  Today I discover that this problem is happening on ALL of the nodes--I cannot get pbs_mom started on any of the compute nodes.  The compute nodes are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8
> 
> Here's an excerpt from dmesg on one of the nodes:
> 
> pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10 error:0
> pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30 error:0
> pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0 error:0
> momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60 error:0
> qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0
> 
> 
> Any ideas on what exactly is going wrong?  This had been running fine until yesterday, and there have been no changes to the system in the past couple weeks.
> 

Can you get a gdb backtrace?

$ gdb qstat
...
(gdb) run
(gdb) bt

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20070909/5403a19b/attachment.bin


More information about the torqueusers mailing list