[torquedev] RE: pbs_mom suddenly throws floating-point exception on execute

Moody, Tristan tmoody at ku.edu
Fri Sep 14 14:01:13 MDT 2007


I did recompile with -g, and the backtrace is still useless:

#0  0x0000003c08f088a8 in ?? ()
#1  0x0000000000000000 in ?? ()

I should point out that the head node (compiling node) is running kernel 2.6.22.4-65.fc7, while the compute nodes are running 2.6.11-1.1369_FC4smp.  I'm not sure this is the cause, since torque worked for about two months and then suddenly quit.  Recompiling and reinstalling the software does nothing to fix the problem.

Tristan

----Original Message-----
From: torquedev-bounces at supercluster.org on behalf of torquedev-request at supercluster.org
Sent: Thu 9/13/2007 1:00 PM
To: torquedev at supercluster.org
Subject: torquedev Digest, Vol 22, Issue 7
 
Send torquedev mailing list submissions to
	torquedev at supercluster.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://www.supercluster.org/mailman/listinfo/torquedev
or, via email, send a message with subject or body 'help' to
	torquedev-request at supercluster.org

You can reach the person managing the list at
	torquedev-owner at supercluster.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of torquedev digest..."


Today's Topics:

   1. RE: pbs_mom suddenly throws floating-point exception	on
      execute (Moody, Tristan)
   2. Re: RE: pbs_mom suddenly throws floating-point	exception	on
      execute (Garrick Staples)


----------------------------------------------------------------------

Message: 1
Date: Wed, 12 Sep 2007 16:23:36 -0500
From: "Moody, Tristan" <tmoody at ku.edu>
Subject: [torquedev] RE: pbs_mom suddenly throws floating-point
	exception	on	execute
To: <torquedev at supercluster.org>
Message-ID:
	<515CE52A7472B549AFE9F121A619F9889408A8 at MAILBOXTHREE.home.ku.edu>
Content-Type: text/plain; charset="iso-8859-1"

This seems unlikely to me, as this has apparently happened to some thirty different machines in a very short timeframe.  Recompiling and reinstalling does not help either.

Tristan




Message: 2
Date: Tue, 11 Sep 2007 21:15:38 +1000
From: Chris Samuel <csamuel at vpac.org>
Subject: Re: [torquedev] pbs_mom suddenly throws floating-point
	exception on	execute
To: torquedev at supercluster.org
Message-ID: <200709112115.40787.csamuel at vpac.org>
Content-Type: text/plain;  charset="iso-8859-1"

On Tuesday 11 September 2007 06:15:03 Moody, Tristan wrote:

> Any ideas on what exactly is going wrong?  This had been running fine until
> last Thursday, and there have been no changes to the system since July.
>  yum.log and the up2date logs are both empty.  It seems odd that the
> software would just suddenly stop working.  Is there anything I'm missing?

My bet would be some form of filesystem corruption has broken the pbs_mom 
binary.   Is it installed from an RPM ?   If so then you should be able to 
use rpm -V $packagename to verify the MD5 checksums for it.

Otherwise check the MD5 of the installed binary with the MD5 of the binary in 
the source tree that you compiled.

Good luck!
Chris (on leave in the UK, so random access to email at the moment)
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia



------------------------------

_______________________________________________
torquedev mailing list
torquedev at supercluster.org
http://www.supercluster.org/mailman/listinfo/torquedev


End of torquedev Digest, Vol 22, Issue 5
****************************************

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 3800 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20070912/a14c77ea/attachment-0001.bin

------------------------------

Message: 2
Date: Wed, 12 Sep 2007 14:20:12 -0700
From: Garrick Staples <garrick at usc.edu>
Subject: Re: [torquedev] RE: pbs_mom suddenly throws floating-point
	exception	on execute
To: torquedev at supercluster.org
Message-ID: <20070912212012.GY19043 at polop.usc.edu>
Content-Type: text/plain; charset="us-ascii"

On Wed, Sep 12, 2007 at 04:23:36PM -0500, Moody, Tristan alleged:
> This seems unlikely to me, as this has apparently happened to some thirty different machines in a very short timeframe.  Recompiling and reinstalling does not help either.

Did you recompile with -g so you could get a usable backtrace?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20070912/fef2386e/attachment-0001.bin

------------------------------

_______________________________________________
torquedev mailing list
torquedev at supercluster.org
http://www.supercluster.org/mailman/listinfo/torquedev


End of torquedev Digest, Vol 22, Issue 7
****************************************

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/ms-tnef
Size: 5169 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20070914/eeb6f712/attachment.bin


More information about the torquedev mailing list