[torqueusers] pbs_mom crashing

Glen Beane glen.beane at gmail.com
Wed Jun 17 07:07:48 MDT 2009


I had two nodes do this yesterday,  in both cases the last thing in
the log file is an "obit sent to server" message.  These nodes had
been running some single processors jobs that took about 15 hours
each.  Probably at most each node had run about a dozen of these jobs
(4 at a time) out of 1,000 or so that were submitted to the cluster.
I'm not sure how long it had been since pbs_mom on these two nodes had
been restarted.





On Tue, Jun 16, 2009 at 9:18 AM, Glen Beane<glen.beane at gmail.com> wrote:
> in the past few months we have upgraded from TORQUE 2.1.X to 2.3.6 (in
> order to use the latest Moab 5.3.x, which we needed for a specific
> feature).  Ever since the upgrade we've had pbs_moms just die.  It
> doesn't seem to be real regular, but something is definitely going on
> with our system.  Has anyone else seen this?  I just had one croak
> today.  The uptime on the node is about 320 days, pbs_mom just seemed
> to die randomly.
>
> Here is what the log file looks like up until pbs_mom went away:
>
>
>
> 06/16/2009 07:40:42;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:42:31;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:49:26;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:50:17;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:51:02;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:51:56;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:55:18;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:56:43;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:58:25;0002;
> pbs_mom;n/a;mom_server_check_connection;connection to server wulfgar
> timeout
> 06/16/2009 07:58:25;0002;
> pbs_mom;n/a;mom_server_check_connection;sending hello to server
> wulfgar
> 06/16/2009 07:58:46;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:00:23;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:01:41;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:02:10;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:15:26;0080;
> pbs_mom;Job;42690.wulfgar.jax.org;scan_for_terminated: job
> 42690.wulfgar.jax.org task 1 terminated, sid=1
> 8362
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42690.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:26;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42690.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42690.wulfgar.jax.org;obit
> sent to server
> 06/16/2009 08:15:26;0080;
> pbs_mom;Job;42691.wulfgar.jax.org;scan_for_terminated: job
> 42691.wulfgar.jax.org task 1 terminated, sid=1
> 8372
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42691.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:26;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42691.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42691.wulfgar.jax.org;obit
> sent to server
> 06/16/2009 08:15:26;0080;
> pbs_mom;Job;42689.wulfgar.jax.org;scan_for_terminated: job
> 42689.wulfgar.jax.org task 1 terminated, sid=1
> 8352
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42689.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:26;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42689.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42689.wulfgar.jax.org;obit
> sent to server
> 06/16/2009 08:15:33;0080;
> pbs_mom;Job;42688.wulfgar.jax.org;scan_for_terminated: job
> 42688.wulfgar.jax.org task 1 terminated, sid=1
> 8343
> 06/16/2009 08:15:33;0008;   pbs_mom;Job;42688.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:33;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:33;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:33;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:33;0008;   pbs_mom;Job;42688.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:33;0080;   pbs_mom;Job;42688.wulfgar.jax.org;obit
> sent to server
>


More information about the torqueusers mailing list