[torqueusers] Floating Point exception for pbs_mom
garrick at usc.edu
Sun Sep 9 10:06:59 MDT 2007
On Sat, Sep 08, 2007 at 10:15:09PM -0500, Moody, Tristan alleged:
> After weeks of running with no problems, I discovered yesterday that about half of the nodes on the cluster I operate were listed as "down." After ssh-ing directly into the nodes to investigate, I discovered that any torque-related software I tried to run on those nodes returned a floating-point exception. Today I discover that this problem is happening on ALL of the nodes--I cannot get pbs_mom started on any of the compute nodes. The compute nodes are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8
> Here's an excerpt from dmesg on one of the nodes:
> pbs_mom trap divide error rip:3c08f088a8 rsp:7fffffe60f10 error:0
> pbs_mom trap divide error rip:3c08f088a8 rsp:7fffffe06d30 error:0
> pbs_mom trap divide error rip:3c08f088a8 rsp:7fffffbef3a0 error:0
> momctl trap divide error rip:3c08f088a8 rsp:7fffff9a9a60 error:0
> qstat trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0
> Any ideas on what exactly is going wrong? This had been running fine until yesterday, and there have been no changes to the system in the past couple weeks.
Can you get a gdb backtrace?
$ gdb qstat
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20070909/5403a19b/attachment.bin
More information about the torqueusers