[torqueusers] pbs_mom crashing

Joshua Bernstein jbernstein at penguincomputing.com
Tue Jun 16 12:51:00 MDT 2009


Hey Glen,

	You may recall I posted something about pbs_mom crashing back in December. 
Based on what the core dump looks like, I did actually implement a fix, if even 
not the correct one, it did make the problem go away:

http://www.clusterresources.com/pipermail/torquedev/2008-December/001276.html

-Joshua Bernstein
Senior Software Engineer
Penguin Computing

Glen Beane wrote:
> in the past few months we have upgraded from TORQUE 2.1.X to 2.3.6 (in
> order to use the latest Moab 5.3.x, which we needed for a specific
> feature).  Ever since the upgrade we've had pbs_moms just die.  It
> doesn't seem to be real regular, but something is definitely going on
> with our system.  Has anyone else seen this?  I just had one croak
> today.  The uptime on the node is about 320 days, pbs_mom just seemed
> to die randomly.
> 
> Here is what the log file looks like up until pbs_mom went away:
> 
> 
> 
> 06/16/2009 07:40:42;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:42:31;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:49:26;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:50:17;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:51:02;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:51:56;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:55:18;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:56:43;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 07:58:25;0002;
> pbs_mom;n/a;mom_server_check_connection;connection to server wulfgar
> timeout
> 06/16/2009 07:58:25;0002;
> pbs_mom;n/a;mom_server_check_connection;sending hello to server
> wulfgar
> 06/16/2009 07:58:46;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:00:23;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:01:41;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:02:10;0002;   pbs_mom;n/a;toolong;alarm call
> 06/16/2009 08:15:26;0080;
> pbs_mom;Job;42690.wulfgar.jax.org;scan_for_terminated: job
> 42690.wulfgar.jax.org task 1 terminated, sid=1
> 8362
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42690.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:26;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42690.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42690.wulfgar.jax.org;obit
> sent to server
> 06/16/2009 08:15:26;0080;
> pbs_mom;Job;42691.wulfgar.jax.org;scan_for_terminated: job
> 42691.wulfgar.jax.org task 1 terminated, sid=1
> 8372
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42691.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:26;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42691.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42691.wulfgar.jax.org;obit
> sent to server
> 06/16/2009 08:15:26;0080;
> pbs_mom;Job;42689.wulfgar.jax.org;scan_for_terminated: job
> 42689.wulfgar.jax.org task 1 terminated, sid=1
> 8352
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42689.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:26;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:26;0008;   pbs_mom;Job;42689.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:26;0080;   pbs_mom;Job;42689.wulfgar.jax.org;obit
> sent to server
> 06/16/2009 08:15:33;0080;
> pbs_mom;Job;42688.wulfgar.jax.org;scan_for_terminated: job
> 42688.wulfgar.jax.org task 1 terminated, sid=1
> 8343
> 06/16/2009 08:15:33;0008;   pbs_mom;Job;42688.wulfgar.jax.org;job was terminated
> 06/16/2009 08:15:33;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
> 06/16/2009 08:15:33;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> top of while loop
> 06/16/2009 08:15:33;0080;   pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 06/16/2009 08:15:33;0008;   pbs_mom;Job;42688.wulfgar.jax.org;checking
> job post-processing routine
> 06/16/2009 08:15:33;0080;   pbs_mom;Job;42688.wulfgar.jax.org;obit
> sent to server
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list