[torquedev] pbs_mom crashing

Glen Beane glen.beane at gmail.com
Tue Jun 16 07:20:51 MDT 2009


in the past few months we have upgraded from TORQUE 2.1.X to 2.3.6 (in
order to use the latest Moab 5.3.x, which we needed for a specific
feature).  Ever since the upgrade we've had pbs_moms just die.  It
doesn't seem to be real regular, but something is definitely going on
with our system.  Has anyone else seen this?  I just had one croak
today.  The uptime on the node is about 320 days, pbs_mom just seemed
to die randomly.

Here is what the log file looks like up until pbs_mom went away:



06/16/2009 07:40:42;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 07:42:31;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 07:49:26;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 07:50:17;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 07:51:02;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 07:51:56;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 07:55:18;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 07:56:43;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 07:58:25;0002;
pbs_mom;n/a;mom_server_check_connection;connection to server wulfgar
timeout
06/16/2009 07:58:25;0002;
pbs_mom;n/a;mom_server_check_connection;sending hello to server
wulfgar
06/16/2009 07:58:46;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 08:00:23;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 08:01:41;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 08:02:10;0002;   pbs_mom;n/a;toolong;alarm call
06/16/2009 08:15:26;0080;
pbs_mom;Job;42690.wulfgar.jax.org;scan_for_terminated: job
42690.wulfgar.jax.org task 1 terminated, sid=1
8362
06/16/2009 08:15:26;0008;   pbs_mom;Job;42690.wulfgar.jax.org;job was terminated
06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
06/16/2009 08:15:26;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop
06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
no error from job stat
06/16/2009 08:15:26;0008;   pbs_mom;Job;42690.wulfgar.jax.org;checking
job post-processing routine
06/16/2009 08:15:26;0080;   pbs_mom;Job;42690.wulfgar.jax.org;obit
sent to server
06/16/2009 08:15:26;0080;
pbs_mom;Job;42691.wulfgar.jax.org;scan_for_terminated: job
42691.wulfgar.jax.org task 1 terminated, sid=1
8372
06/16/2009 08:15:26;0008;   pbs_mom;Job;42691.wulfgar.jax.org;job was terminated
06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
06/16/2009 08:15:26;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop
06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
no error from job stat
06/16/2009 08:15:26;0008;   pbs_mom;Job;42691.wulfgar.jax.org;checking
job post-processing routine
06/16/2009 08:15:26;0080;   pbs_mom;Job;42691.wulfgar.jax.org;obit
sent to server
06/16/2009 08:15:26;0080;
pbs_mom;Job;42689.wulfgar.jax.org;scan_for_terminated: job
42689.wulfgar.jax.org task 1 terminated, sid=1
8352
06/16/2009 08:15:26;0008;   pbs_mom;Job;42689.wulfgar.jax.org;job was terminated
06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
06/16/2009 08:15:26;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop
06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
no error from job stat
06/16/2009 08:15:26;0008;   pbs_mom;Job;42689.wulfgar.jax.org;checking
job post-processing routine
06/16/2009 08:15:26;0080;   pbs_mom;Job;42689.wulfgar.jax.org;obit
sent to server
06/16/2009 08:15:33;0080;
pbs_mom;Job;42688.wulfgar.jax.org;scan_for_terminated: job
42688.wulfgar.jax.org task 1 terminated, sid=1
8343
06/16/2009 08:15:33;0008;   pbs_mom;Job;42688.wulfgar.jax.org;job was terminated
06/16/2009 08:15:33;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
06/16/2009 08:15:33;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop
06/16/2009 08:15:33;0080;   pbs_mom;Svr;preobit_reply;in while loop,
no error from job stat
06/16/2009 08:15:33;0008;   pbs_mom;Job;42688.wulfgar.jax.org;checking
job post-processing routine
06/16/2009 08:15:33;0080;   pbs_mom;Job;42688.wulfgar.jax.org;obit
sent to server


More information about the torquedev mailing list