[torqueusers] pbs_sched crash

Alexander Saydakov saydakov at yahoo-inc.com
Thu Mar 23 11:19:48 MST 2006


BTW, pbs_sched seems to have a memory leak. When I restarted the thing
yesterday its footprint was about 50M. Now, slightly more than a day later,
it is 225M already. It could be unrelated to the crash in question though.


-----Original Message-----
From: Alexander Saydakov [mailto:saydakov at yahoo-inc.com] 
Sent: Wednesday, March 22, 2006 3:25 PM
To: 'Garrick Staples'; 'torqueusers at supercluster.org'
Subject: RE: [torqueusers] pbs_sched crash

Here are log entries around the core dump time (23:07):

03/21/2006 00:23:06;0100;PBS_Server;Req;;Type ResourceQuery request received
from Scheduler at smag1.data.yahoo.com, sock=17
03/21/2006
00:23:07;0100;PBS_Server;Job;639286.smag1.data.yahoo.com;dequeuing from
queue1, state COMPLETE
03/21/2006 00:23:12;0001;PBS_Server;Svr;PBS_Server;Operation now in progress
(36) in contact_sched, Could not contact Scheduler - port 15004


-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick Staples
Sent: Wednesday, March 22, 2006 11:29 AM
To: torqueusers at supercluster.org
Subject: Re: [torqueusers] pbs_sched crash

On Wed, Mar 22, 2006 at 11:04:08AM -0800, Alexander Saydakov alleged:
> #0  0x1013c8e in pbs_rescquery (c=0, resclist=0x9fbff484, num_resc=1,
> available=0x9fbff498, allocated=0x9fbff494, reserved=0x9fbff490,
> down=0x9fbff48c)
> 
>     at ./../Libifl/pbsD_resc.c:218
> 
> 218           *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);

Can you check your server logs?  I bet pbs_server was hung on something
causing a timeout in the scheduler's pbs_rescquery() call.


That code looks wrong to me.  I think it should be 'if (pbs_errno ==
PBSE_NONE)'

  if ((rc = PBS_resc(c,PBS_BATCH_Rescq,resclist,num_resc,(resource_t)0)) !=
0)
    {
    return(rc);
    }

  /* read in reply */

  reply = PBSD_rdrpy(c);

  if (rc == PBSE_NONE)
    {
    /* copy in available and allocated numbers */

    for (i = 0;i < num_resc;++i)
      {
      *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California



More information about the torqueusers mailing list