[torquedev] pbs_mom suddenly throws floating-point exception on execute

Moody, Tristan tmoody at ku.edu
Mon Sep 10 14:15:03 MDT 2007



(cross-posted in torque-users as well)

After weeks of running with no problems, I discovered last Thursday that about half of the nodes on the cluster I operate were listed as "down."  After ssh-ing directly into the nodes to investigate, I discovered that any torque-related software I tried to run on those nodes returned a floating-point exception.  On Friday I discovered that this problem is now happening on ALL of the nodes--I cannot get pbs_mom started on any of the compute nodes.  The compute nodes are running kernel 2.6.11-1.1369_FC4smp x86_64 and torque version 2.1.8

Here's an excerpt from dmesg on one of the nodes:

pbs_mom[1903] trap divide error rip:3c08f088a8 rsp:7fffffe60f10 error:0
pbs_mom[2408] trap divide error rip:3c08f088a8 rsp:7fffffe06d30 error:0
pbs_mom[2620] trap divide error rip:3c08f088a8 rsp:7fffffbef3a0 error:0
momctl[2774] trap divide error rip:3c08f088a8 rsp:7fffff9a9a60 error:0
qstat[2775] trap divide error rip:3c08f088a8 rsp:7fffffe14df0 error:0

A gdb backtrace gives the following:

(gdb) run
Starting program: /usr/local/bin/qstat

Program received signal SIGFPE, Arithmetic exception.
0x0000003c08f088a8 in ?? ()
(gdb) bt
#0  0x0000003c08f088a8 in ?? ()
#1  0x00007fffffbf3a20 in ?? ()
#2  0x00007fffffbf38e0 in ?? ()
#3  0x00007fffffbf38d0 in ?? ()
#4  0x0000003c0910b858 in ?? ()
#5  0x00000000000668c3 in ?? ()
#6  0x0000003c09111dc0 in ?? ()
#7  0x0000000000000000 in ?? ()
(gdb)


Any ideas on what exactly is going wrong?  This had been running fine until last Thursday, and there have been no changes to the system since July.  yum.log and the up2date logs are both empty.  It seems odd that the software would just suddenly stop working.  Is there anything I'm missing?


Tristan Moody


More information about the torquedev mailing list