[torquedev] pbs_moms dying from SIGALRM
Garrick Staples
garrick at usc.edu
Thu Jan 12 11:21:11 MST 2006
On Wed, Jan 11, 2006 at 05:12:40PM -0700, Marcus R. Epperson alleged:
> We have been seeing pbs_mom's die somewhat regularly on our machines and
> have tracked down the apparent cause. It seems that under heavy load, some
> of the alarm calls in is_update_stat() were expiring, and since run_pelog()
> had reset the SIGALRM handler set to SIG_DFL, this was fatal to the mom
> (and the job, subsequently).
>
> The attached patch fixes this by making run_pelog() restore the previous
> handler rather than forcing it to be SIG_DFL. With this patch applied, the
> mom runs the original SIGALRM handler (toolong()) rather than dying. This
> handler doesn't have any real effect, other than informing you that some
> action took longer than expected, but that seems to be more correct than
> the alternative.
>
> To simulate the failure:
> - Watch the mom_logs while you run the following on a node that's not part
> of a job:
> # killall -ALRM pbs_mom
> You should see a log message similar to this:
> 01/11/2006 15:53:51;0002; pbs_mom;n/a;toolong;alarm call
> - Then launch a job on the node, wait for the JOIN JOB message, and then
> run the killall command again. On our systems this was killing the mom
> process immediately.
>
> --
> --------------------------------------------------
> Marcus R. Epperson -- SNL
> --------------------------------------------------
> --
Checked in your patch. Thanks!
We should also make sure that every function that sets an alarm always
sets the appropriate handler. I think the primary offender here is
is_update_stat().
> --- src/resmom/prolog.c-orig 2006-01-11 15:44:10.000000000 -0700
> +++ src/resmom/prolog.c 2006-01-11 15:45:36.000000000 -0700
> @@ -306,7 +306,7 @@ int run_pelog(
> {
> char *id = "run_pelog";
>
> - struct sigaction act;
> + struct sigaction act, oldact;
> char *arg[11];
> int fds1 = 0;
> int fds2 = 0;
> @@ -388,7 +388,7 @@ int run_pelog(
>
> act.sa_flags = 0;
>
> - sigaction(SIGALRM,&act,0);
> + sigaction(SIGALRM,&act,&oldact);
>
> /* it would be nice if the harvest routine could block for 5 seconds,
> and if the prolog is not complete in that time, mark job as prolog
> @@ -446,9 +446,8 @@ int run_pelog(
>
> alarm(0);
>
> - act.sa_handler = SIG_DFL;
> -
> - sigaction(SIGALRM,&act,0);
> + /* restore the previous handler */
> + sigaction(SIGALRM,&oldact,0);
>
> if (run_exit == 0)
> {
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20060112/3354d2d8/attachment.bin
More information about the torquedev
mailing list