[torqueusers] pbs_sched crash

Alexander Saydakov saydakov at yahoo-inc.com
Fri Apr 28 10:46:08 MDT 2006


Wonderful. Thanks. I will give it a try.


-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick Staples
Sent: Thursday, April 27, 2006 8:48 PM
To: torqueusers at supercluster.org
Subject: Re: [torqueusers] pbs_sched crash

On Wed, Mar 22, 2006 at 11:04:08AM -0800, Alexander Saydakov alleged:
> Last night pbs_sched crashed leaving our 70+ nodes idle all night long :(
> 
> #0  0x1013c8e in pbs_rescquery (c=0, resclist=0x9fbff484, num_resc=1,
> available=0x9fbff498, allocated=0x9fbff494, reserved=0x9fbff490,
> down=0x9fbff48c)
> 
>     at ./../Libifl/pbsD_resc.c:218
> 
> 218           *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);

I just checked in this fix for 2.1.0, you can patch your 2.0.0 if you
want.  It might even help the memory leak.

Index: src/lib/Libifl/pbsD_resc.c
===================================================================
RCS file:
/usr/local/nfs/src/cvs_repository/torque/src/lib/Libifl/pbsD_resc.c,v
retrieving revision 1.3
diff -u -r1.3 pbsD_resc.c
--- src/lib/Libifl/pbsD_resc.c  23 Mar 2006 02:01:50 -0000      1.3
+++ src/lib/Libifl/pbsD_resc.c  28 Apr 2006 03:44:23 -0000
@@ -209,7 +209,7 @@
   
   reply = PBSD_rdrpy(c);

-  if (rc == PBSE_NONE)
+  if (((rc = connection[c].ch_errno) == PBSE_NONE))
     {
     /* copy in available and allocated numbers */



-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California



More information about the torqueusers mailing list