[torqueusers] pbs_mom crashing
Glen Beane
glen.beane at gmail.com
Tue Jun 16 13:56:04 MDT 2009
we made your errno change, and if you rememeber, the close_conn
change you made doesn't appear to do anything. You changed from
close() to close_conn() inside the else block, where
svr_conn[i].cn_active is guaranteed to be idle, but close_conn just
returns if cn_active is idle (copy and pasted from the thread):
The problem with changing it to
close_conn() in this situation is that close_conn WON'T DO ANYTHING
because svr_conn[i].cn_active == Idle when it is called!
if (svr_conn[i].cn_active != Idle)
{
netcounter_incr();
svr_conn[i].cn_func(i);
/* NOTE: breakout if state changed (probably received
shutdown request) */
if ((SState != NULL) && (OrigState != *SState))
break;
}
else /* XXXXXXX svr_conn[i].cn_active == Idle !!!! */
{
close_conn(i);
}
void close_conn(
int sd) /* I */
{
if ((sd < 0) || (max_connection <= sd))
{
return;
}
if (svr_conn[sd].cn_active == Idle) /* XXXXXXXXXXXXXX LOOK AT ME!
I'm not going to do anything!*/
{
return;
}
close(sd);
On Tue, Jun 16, 2009 at 2:51 PM, Joshua
Bernstein<jbernstein at penguincomputing.com> wrote:
> Hey Glen,
>
> You may recall I posted something about pbs_mom crashing back in
> December. Based on what the core dump looks like, I did actually implement a
> fix, if even not the correct one, it did make the problem go away:
>
> http://www.clusterresources.com/pipermail/torquedev/2008-December/001276.html
>
> -Joshua Bernstein
> Senior Software Engineer
> Penguin Computing
>
> Glen Beane wrote:
>>
>> in the past few months we have upgraded from TORQUE 2.1.X to 2.3.6 (in
>> order to use the latest Moab 5.3.x, which we needed for a specific
>> feature). Ever since the upgrade we've had pbs_moms just die. It
>> doesn't seem to be real regular, but something is definitely going on
>> with our system. Has anyone else seen this? I just had one croak
>> today. The uptime on the node is about 320 days, pbs_mom just seemed
>> to die randomly.
>>
>> Here is what the log file looks like up until pbs_mom went away:
>>
>>
>>
>> 06/16/2009 07:40:42;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:42:31;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:49:26;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:50:17;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:51:02;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:51:56;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:55:18;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:56:43;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:58:25;0002;
>> pbs_mom;n/a;mom_server_check_connection;connection to server wulfgar
>> timeout
>> 06/16/2009 07:58:25;0002;
>> pbs_mom;n/a;mom_server_check_connection;sending hello to server
>> wulfgar
>> 06/16/2009 07:58:46;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 08:00:23;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 08:01:41;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 08:02:10;0002; pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Job;42690.wulfgar.jax.org;scan_for_terminated: job
>> 42690.wulfgar.jax.org task 1 terminated, sid=1
>> 8362
>> 06/16/2009 08:15:26;0008; pbs_mom;Job;42690.wulfgar.jax.org;job was
>> terminated
>> 06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
>> top of while loop
>> 06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;in while loop,
>> no error from job stat
>> 06/16/2009 08:15:26;0008; pbs_mom;Job;42690.wulfgar.jax.org;checking
>> job post-processing routine
>> 06/16/2009 08:15:26;0080; pbs_mom;Job;42690.wulfgar.jax.org;obit
>> sent to server
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Job;42691.wulfgar.jax.org;scan_for_terminated: job
>> 42691.wulfgar.jax.org task 1 terminated, sid=1
>> 8372
>> 06/16/2009 08:15:26;0008; pbs_mom;Job;42691.wulfgar.jax.org;job was
>> terminated
>> 06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
>> top of while loop
>> 06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;in while loop,
>> no error from job stat
>> 06/16/2009 08:15:26;0008; pbs_mom;Job;42691.wulfgar.jax.org;checking
>> job post-processing routine
>> 06/16/2009 08:15:26;0080; pbs_mom;Job;42691.wulfgar.jax.org;obit
>> sent to server
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Job;42689.wulfgar.jax.org;scan_for_terminated: job
>> 42689.wulfgar.jax.org task 1 terminated, sid=1
>> 8352
>> 06/16/2009 08:15:26;0008; pbs_mom;Job;42689.wulfgar.jax.org;job was
>> terminated
>> 06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
>> top of while loop
>> 06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;in while loop,
>> no error from job stat
>> 06/16/2009 08:15:26;0008; pbs_mom;Job;42689.wulfgar.jax.org;checking
>> job post-processing routine
>> 06/16/2009 08:15:26;0080; pbs_mom;Job;42689.wulfgar.jax.org;obit
>> sent to server
>> 06/16/2009 08:15:33;0080;
>> pbs_mom;Job;42688.wulfgar.jax.org;scan_for_terminated: job
>> 42688.wulfgar.jax.org task 1 terminated, sid=1
>> 8343
>> 06/16/2009 08:15:33;0008; pbs_mom;Job;42688.wulfgar.jax.org;job was
>> terminated
>> 06/16/2009 08:15:33;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
>> 06/16/2009 08:15:33;0080;
>> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
>> top of while loop
>> 06/16/2009 08:15:33;0080; pbs_mom;Svr;preobit_reply;in while loop,
>> no error from job stat
>> 06/16/2009 08:15:33;0008; pbs_mom;Job;42688.wulfgar.jax.org;checking
>> job post-processing routine
>> 06/16/2009 08:15:33;0080; pbs_mom;Job;42688.wulfgar.jax.org;obit
>> sent to server
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list