[torquedev] pbs_mom crashing

Ken Nielson knielson at clusterresources.com
Tue Jun 16 09:49:40 MDT 2009


Glen,

Did your mom terminate with a SIGPIPE?

If so I have a fix for that problem I would like you to try.

Let me know.

Thanks

Ken Nielson
Cluster Resources, Inc.

----- Original Message -----
From: "Glen Beane" <glen.beane at gmail.com>
To: "Torque Dev" <torquedev at supercluster.org>
Sent: Tuesday, June 16, 2009 7:56:24 AM GMT -07:00 US/Canada Mountain
Subject: Re: [torquedev] pbs_mom crashing

we're going to enable core dumps before starting pbs_mom on some nodes
and hopefully we'll catch something





On Tue, Jun 16, 2009 at 9:20 AM, Glen Beane<glen.beane at gmail.com> wrote:
> in the past few months we have upgraded from TORQUE 2.1.X to 2.3.6 (in
> order to use the latest Moab 5.3.x, which we needed for a specific
> feature).  Ever since the upgrade we've had pbs_moms just die.  It
> doesn't seem to be real regular, but something is definitely going on
> with our system.  Has anyone else seen this?  I just had one croak
> today.  The uptime on the node is about 320 days, pbs_mom just seemed
> to die randomly.
>
> Here is what the log file looks like up until pbs_mom went away:
>
>
>
> 06/16/2009 07:40:42;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:42:31;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:49:26;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:50:17;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:51:02;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:51:56;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:55:18;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:56:43;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:58:25;0002;
> pbs_mom;n/a;mom_server_check_connection;connection to server wulfgar
> timeout
> 06/16/2009 07:58:25;0002;
> pbs_mom;n/a;mom_server_check_connection;sending hello to server
> wulfgar
> 06/16/2009 07:58:46;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:00:23;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:01:41;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:02:10;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:15:26;0080;
> pbs_mom;Job;42690.wulfgar.jax.org;scan_for_terminated: job
> 42690.wulfgar.jax.org task 1 terminated, sid=1
> 8362
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42690.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:26;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42690.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42690.wulfgar.jax.org;obit
> sent to server
> 06/16/2009 08:15:26;0080;
> pbs_mom;Job;42691.wulfgar.jax.org;scan_for_terminated: job
> 42691.wulfgar.jax.org task 1 terminated, sid=1
> 8372
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42691.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:26;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42691.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42691.wulfgar.jax.org;obit
> sent to server
> 06/16/2009 08:15:26;0080;
> pbs_mom;Job;42689.wulfgar.jax.org;scan_for_terminated: job
> 42689.wulfgar.jax.org task 1 terminated, sid=1
> 8352
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42689.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:26;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42689.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42689.wulfgar.jax.org;obit
> sent to server
> 06/16/2009 08:15:33;0080;
> pbs_mom;Job;42688.wulfgar.jax.org;scan_for_terminated: job
> 42688.wulfgar.jax.org task 1 terminated, sid=1
> 8343
> 06/16/2009 08:15:33;0008;   pbs_mom;Job;42688.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:33;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:33;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:33;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:33;0008;   pbs_mom;Job;42688.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:33;0080;   pbs_mom;Job;42688.wulfgar.jax.org;obit
> sent to server
>
_______________________________________________
torquedev mailing list
torquedev at supercluster.org
http://www.supercluster.org/mailman/listinfo/torquedev


More information about the torquedev mailing list