[torqueusers] Floating Point exception for pbs_mom

Garrick Staples garrick at usc.edu
Sun Sep 9 10:06:59 MDT 2007

On Sat, Sep 08, 2007 at 10:15:09PM -0500, Moody, Tristan alleged:
> After weeks of running with no problems, I discovered yesterday that about half of the nodes on the cluster I operate were listed as "down."  After ssh-ing directly into the nodes to investigate, I discovered that any torque-related software I tried to run on those nodes returned a floating-point exception.  Today I discover that this problem is happening on ALL of the nodes--I cannot get pbs_mom started on any of the compute nodes.  The compute nodes are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8
> Here's an excerpt from dmesg on one of the nodes:
> pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10 error:0
> pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30 error:0
> pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0 error:0
> momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60 error:0
> qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0
> Any ideas on what exactly is going wrong?  This had been running fine until yesterday, and there have been no changes to the system in the past couple weeks.

Can you get a gdb backtrace?

$ gdb qstat
(gdb) run
(gdb) bt

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20070909/5403a19b/attachment.bin

More information about the torqueusers mailing list