[torqueusers] pbs_mom crashing

Glen Beane glen.beane at gmail.com
Tue Jun 16 13:56:04 MDT 2009


we made your errno change,  and if you rememeber, the close_conn
change you made doesn't appear to do anything.  You changed from
close() to close_conn() inside the else block, where
svr_conn[i].cn_active is guaranteed to be idle, but close_conn just
returns if cn_active is idle (copy and pasted from the thread):


The problem with changing it to
close_conn() in this situation is that close_conn WON'T DO ANYTHING
because svr_conn[i].cn_active == Idle when it is called!

    if (svr_conn[i].cn_active != Idle)
        {
        netcounter_incr();

        svr_conn[i].cn_func(i);

        /* NOTE:  breakout if state changed (probably received
shutdown request) */

        if ((SState != NULL) && (OrigState != *SState))
          break;
        }
      else  /* XXXXXXX svr_conn[i].cn_active == Idle !!!! */
        {
        close_conn(i);
        }



void close_conn(

  int sd) /* I */

  {
  if ((sd < 0) || (max_connection <= sd))
    {
    return;
    }

  if (svr_conn[sd].cn_active == Idle)    /* XXXXXXXXXXXXXX LOOK AT ME!
I'm not going to do anything!*/
    {
    return;
    }

  close(sd);






On Tue, Jun 16, 2009 at 2:51 PM, Joshua
Bernstein<jbernstein at penguincomputing.com> wrote:
> Hey Glen,
>
>        You may recall I posted something about pbs_mom crashing back in
> December. Based on what the core dump looks like, I did actually implement a
> fix, if even not the correct one, it did make the problem go away:
>
> http://www.clusterresources.com/pipermail/torquedev/2008-December/001276.html
>
> -Joshua Bernstein
> Senior Software Engineer
> Penguin Computing
>
> Glen Beane wrote:
>>
>> in the past few months we have upgraded from TORQUE 2.1.X to 2.3.6 (in
>> order to use the latest Moab 5.3.x, which we needed for a specific
>> feature).  Ever since the upgrade we've had pbs_moms just die.  It
>> doesn't seem to be real regular, but something is definitely going on
>> with our system.  Has anyone else seen this?  I just had one croak
>> today.  The uptime on the node is about 320 days, pbs_mom just seemed
>> to die randomly.
>>
>> Here is what the log file looks like up until pbs_mom went away:
>>
>>
>>
>> 06/16/2009 07:40:42;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:42:31;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:49:26;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:50:17;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:51:02;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:51:56;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:55:18;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:56:43;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 07:58:25;0002;
>> pbs_mom;n/a;mom_server_check_connection;connection to server wulfgar
>> timeout
>> 06/16/2009 07:58:25;0002;
>> pbs_mom;n/a;mom_server_check_connection;sending hello to server
>> wulfgar
>> 06/16/2009 07:58:46;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 08:00:23;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 08:01:41;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 08:02:10;0002;   pbs_mom;n/a;toolong;alarm call
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Job;42690.wulfgar.jax.org;scan_for_terminated: job
>> 42690.wulfgar.jax.org task 1 terminated, sid=1
>> 8362
>> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42690.wulfgar.jax.org;job was
>> terminated
>> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
>> top of while loop
>> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
>> no error from job stat
>> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42690.wulfgar.jax.org;checking
>> job post-processing routine
>> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42690.wulfgar.jax.org;obit
>> sent to server
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Job;42691.wulfgar.jax.org;scan_for_terminated: job
>> 42691.wulfgar.jax.org task 1 terminated, sid=1
>> 8372
>> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42691.wulfgar.jax.org;job was
>> terminated
>> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
>> top of while loop
>> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
>> no error from job stat
>> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42691.wulfgar.jax.org;checking
>> job post-processing routine
>> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42691.wulfgar.jax.org;obit
>> sent to server
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Job;42689.wulfgar.jax.org;scan_for_terminated: job
>> 42689.wulfgar.jax.org task 1 terminated, sid=1
>> 8352
>> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42689.wulfgar.jax.org;job was
>> terminated
>> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
>> 06/16/2009 08:15:26;0080;
>> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
>> top of while loop
>> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
>> no error from job stat
>> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42689.wulfgar.jax.org;checking
>> job post-processing routine
>> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42689.wulfgar.jax.org;obit
>> sent to server
>> 06/16/2009 08:15:33;0080;
>> pbs_mom;Job;42688.wulfgar.jax.org;scan_for_terminated: job
>> 42688.wulfgar.jax.org task 1 terminated, sid=1
>> 8343
>> 06/16/2009 08:15:33;0008;   pbs_mom;Job;42688.wulfgar.jax.org;job was
>> terminated
>> 06/16/2009 08:15:33;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
>> 06/16/2009 08:15:33;0080;
>> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
>> top of while loop
>> 06/16/2009 08:15:33;0080;   pbs_mom;Svr;preobit_reply;in while loop,
>> no error from job stat
>> 06/16/2009 08:15:33;0008;   pbs_mom;Job;42688.wulfgar.jax.org;checking
>> job post-processing routine
>> 06/16/2009 08:15:33;0080;   pbs_mom;Job;42688.wulfgar.jax.org;obit
>> sent to server
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list