[torqueusers] Floating Point exception for pbs_mom

Moody, Tristan tmoody at ku.edu
Sat Sep 8 21:15:09 MDT 2007

After weeks of running with no problems, I discovered yesterday that about half of the nodes on the cluster I operate were listed as "down."  After ssh-ing directly into the nodes to investigate, I discovered that any torque-related software I tried to run on those nodes returned a floating-point exception.  Today I discover that this problem is happening on ALL of the nodes--I cannot get pbs_mom started on any of the compute nodes.  The compute nodes are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8

Here's an excerpt from dmesg on one of the nodes:

pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10 error:0
pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30 error:0
pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0 error:0
momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60 error:0
qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0

Any ideas on what exactly is going wrong?  This had been running fine until yesterday, and there have been no changes to the system in the past couple weeks.

