[torqueusers] pbs_sched crash
Alexander Saydakov
saydakov at yahoo-inc.com
Thu Mar 23 11:19:48 MST 2006
BTW, pbs_sched seems to have a memory leak. When I restarted the thing
yesterday its footprint was about 50M. Now, slightly more than a day later,
it is 225M already. It could be unrelated to the crash in question though.
-----Original Message-----
From: Alexander Saydakov [mailto:saydakov at yahoo-inc.com]
Sent: Wednesday, March 22, 2006 3:25 PM
To: 'Garrick Staples'; 'torqueusers at supercluster.org'
Subject: RE: [torqueusers] pbs_sched crash
Here are log entries around the core dump time (23:07):
03/21/2006 00:23:06;0100;PBS_Server;Req;;Type ResourceQuery request received
from Scheduler at smag1.data.yahoo.com, sock=17
03/21/2006
00:23:07;0100;PBS_Server;Job;639286.smag1.data.yahoo.com;dequeuing from
queue1, state COMPLETE
03/21/2006 00:23:12;0001;PBS_Server;Svr;PBS_Server;Operation now in progress
(36) in contact_sched, Could not contact Scheduler - port 15004
-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick Staples
Sent: Wednesday, March 22, 2006 11:29 AM
To: torqueusers at supercluster.org
Subject: Re: [torqueusers] pbs_sched crash
On Wed, Mar 22, 2006 at 11:04:08AM -0800, Alexander Saydakov alleged:
> #0 0x1013c8e in pbs_rescquery (c=0, resclist=0x9fbff484, num_resc=1,
> available=0x9fbff498, allocated=0x9fbff494, reserved=0x9fbff490,
> down=0x9fbff48c)
>
> at ./../Libifl/pbsD_resc.c:218
>
> 218 *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);
Can you check your server logs? I bet pbs_server was hung on something
causing a timeout in the scheduler's pbs_rescquery() call.
That code looks wrong to me. I think it should be 'if (pbs_errno ==
PBSE_NONE)'
if ((rc = PBS_resc(c,PBS_BATCH_Rescq,resclist,num_resc,(resource_t)0)) !=
0)
{
return(rc);
}
/* read in reply */
reply = PBSD_rdrpy(c);
if (rc == PBSE_NONE)
{
/* copy in available and allocated numbers */
for (i = 0;i < num_resc;++i)
{
*(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
More information about the torqueusers
mailing list