[torqueusers] pbs_mom crashing
Glen Beane
glen.beane at gmail.com
Tue Jun 16 07:18:53 MDT 2009
in the past few months we have upgraded from TORQUE 2.1.X to 2.3.6 (in
order to use the latest Moab 5.3.x, which we needed for a specific
feature). Ever since the upgrade we've had pbs_moms just die. It
doesn't seem to be real regular, but something is definitely going on
with our system. Has anyone else seen this? I just had one croak
today. The uptime on the node is about 320 days, pbs_mom just seemed
to die randomly.
Here is what the log file looks like up until pbs_mom went away:
06/16/2009 07:40:42;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 07:42:31;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 07:49:26;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 07:50:17;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 07:51:02;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 07:51:56;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 07:55:18;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 07:56:43;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 07:58:25;0002;
pbs_mom;n/a;mom_server_check_connection;connection to server wulfgar
timeout
06/16/2009 07:58:25;0002;
pbs_mom;n/a;mom_server_check_connection;sending hello to server
wulfgar
06/16/2009 07:58:46;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 08:00:23;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 08:01:41;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 08:02:10;0002; pbs_mom;n/a;toolong;alarm call
06/16/2009 08:15:26;0080;
pbs_mom;Job;42690.wulfgar.jax.org;scan_for_terminated: job
42690.wulfgar.jax.org task 1 terminated, sid=1
8362
06/16/2009 08:15:26;0008; pbs_mom;Job;42690.wulfgar.jax.org;job was terminated
06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
06/16/2009 08:15:26;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop
06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;in while loop,
no error from job stat
06/16/2009 08:15:26;0008; pbs_mom;Job;42690.wulfgar.jax.org;checking
job post-processing routine
06/16/2009 08:15:26;0080; pbs_mom;Job;42690.wulfgar.jax.org;obit
sent to server
06/16/2009 08:15:26;0080;
pbs_mom;Job;42691.wulfgar.jax.org;scan_for_terminated: job
42691.wulfgar.jax.org task 1 terminated, sid=1
8372
06/16/2009 08:15:26;0008; pbs_mom;Job;42691.wulfgar.jax.org;job was terminated
06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
06/16/2009 08:15:26;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop
06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;in while loop,
no error from job stat
06/16/2009 08:15:26;0008; pbs_mom;Job;42691.wulfgar.jax.org;checking
job post-processing routine
06/16/2009 08:15:26;0080; pbs_mom;Job;42691.wulfgar.jax.org;obit
sent to server
06/16/2009 08:15:26;0080;
pbs_mom;Job;42689.wulfgar.jax.org;scan_for_terminated: job
42689.wulfgar.jax.org task 1 terminated, sid=1
8352
06/16/2009 08:15:26;0008; pbs_mom;Job;42689.wulfgar.jax.org;job was terminated
06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
06/16/2009 08:15:26;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop
06/16/2009 08:15:26;0080; pbs_mom;Svr;preobit_reply;in while loop,
no error from job stat
06/16/2009 08:15:26;0008; pbs_mom;Job;42689.wulfgar.jax.org;checking
job post-processing routine
06/16/2009 08:15:26;0080; pbs_mom;Job;42689.wulfgar.jax.org;obit
sent to server
06/16/2009 08:15:33;0080;
pbs_mom;Job;42688.wulfgar.jax.org;scan_for_terminated: job
42688.wulfgar.jax.org task 1 terminated, sid=1
8343
06/16/2009 08:15:33;0008; pbs_mom;Job;42688.wulfgar.jax.org;job was terminated
06/16/2009 08:15:33;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
06/16/2009 08:15:33;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
top of while loop
06/16/2009 08:15:33;0080; pbs_mom;Svr;preobit_reply;in while loop,
no error from job stat
06/16/2009 08:15:33;0008; pbs_mom;Job;42688.wulfgar.jax.org;checking
job post-processing routine
06/16/2009 08:15:33;0080; pbs_mom;Job;42688.wulfgar.jax.org;obit
sent to server
More information about the torqueusers
mailing list