[torquedev] RE: pbs_mom suddenly throws floating-point exception on execute

Moody, Tristan tmoody at ku.edu
Fri Sep 14 14:41:45 MDT 2007


The binaries are hosted locally on each of the 30 compute nodes.  To add to the confusion, we have another cluster with 14 compute nodes on which torque works perfectly.

Even compiling with -g -O0 gives me a useless backtrace, though it seems that every single torque program crashes at the same address: 0x0000003c08f088a8.  I may have to revert back to an older version to see if that will work.

Tristan


----------------Original Message Follows:--------------------

On Thursday 13 September 2007 07:23:36 Moody, Tristan wrote:

> This seems unlikely to me, as this has apparently happened to some thirty
> different machines in a very short timeframe.

But if the binary is served out from an NFS server (which is normal practice 
here at VPAC) the same corruption on the server could affect the clients as 
it runs.  I've seen cases where modifying a binary on an NFS server (SLES9) 
killed the running binaries on the clients over a short period of time. :-(

> Recompiling and reinstalling does not help either.

That would appear to rule that out then.. :-(

Did you have any luck with compiling it with "-g -O0" to get some decent 
debugging out of it with gdb ? 

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


More information about the torquedev mailing list