[torqueusers] Floating Point exception for pbs_mom

Moody, Tristan tmoody at ku.edu
Sat Sep 8 21:15:09 MDT 2007


 
After weeks of running with no problems, I discovered yesterday that about half of the nodes on the cluster I operate were listed as "down."  After ssh-ing directly into the nodes to investigate, I discovered that any torque-related software I tried to run on those nodes returned a floating-point exception.  Today I discover that this problem is happening on ALL of the nodes--I cannot get pbs_mom started on any of the compute nodes.  The compute nodes are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8

Here's an excerpt from dmesg on one of the nodes:

pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10 error:0
pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30 error:0
pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0 error:0
momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60 error:0
qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0


Any ideas on what exactly is going wrong?  This had been running fine until yesterday, and there have been no changes to the system in the past couple weeks.


Tristan Moody






More information about the torqueusers mailing list