[torquedev] pbs_moms dying from SIGALRM

Garrick Staples garrick at usc.edu
Thu Jan 12 11:21:11 MST 2006


On Wed, Jan 11, 2006 at 05:12:40PM -0700, Marcus R. Epperson alleged:
> We have been seeing pbs_mom's die somewhat regularly on our machines and 
> have tracked down the apparent cause.  It seems that under heavy load, some 
> of the alarm calls in is_update_stat() were expiring, and since run_pelog() 
> had reset the SIGALRM handler set to SIG_DFL, this was fatal to the mom 
> (and the job, subsequently).
> 
> The attached patch fixes this by making run_pelog() restore the previous 
> handler rather than forcing it to be SIG_DFL.  With this patch applied, the 
> mom runs the original SIGALRM handler (toolong()) rather than dying.  This 
> handler doesn't have any real effect, other than informing you that some 
> action took longer than expected, but that seems to be more correct than 
> the alternative.
> 
> To simulate the failure:
> - Watch the mom_logs while you run the following on a node that's not part 
> of a job:
>    # killall -ALRM pbs_mom
>  You should see a log message similar to this:
>    01/11/2006 15:53:51;0002;   pbs_mom;n/a;toolong;alarm call
> - Then launch a job on the node, wait for the JOIN JOB message, and then 
> run the killall command again.  On our systems this was killing the mom 
> process immediately.
> 
> -- 
> --------------------------------------------------
> Marcus R. Epperson -- SNL
> --------------------------------------------------
> --

Checked in your patch.  Thanks!

We should also make sure that every function that sets an alarm always
sets the appropriate handler.  I think the primary offender here is
is_update_stat().


> --- src/resmom/prolog.c-orig	2006-01-11 15:44:10.000000000 -0700
> +++ src/resmom/prolog.c	2006-01-11 15:45:36.000000000 -0700
> @@ -306,7 +306,7 @@ int run_pelog(
>    {
>    char *id = "run_pelog";
>  
> -  struct sigaction act;
> +  struct sigaction act, oldact;
>    char	        *arg[11];
>    int		 fds1 = 0;
>    int		 fds2 = 0;
> @@ -388,7 +388,7 @@ int run_pelog(
>  
>      act.sa_flags = 0;
>  
> -    sigaction(SIGALRM,&act,0);
> +    sigaction(SIGALRM,&act,&oldact);
>  
>      /* it would be nice if the harvest routine could block for 5 seconds, 
>         and if the prolog is not complete in that time, mark job as prolog 
> @@ -446,9 +446,8 @@ int run_pelog(
>  
>      alarm(0);
>  
> -    act.sa_handler = SIG_DFL;
> -
> -    sigaction(SIGALRM,&act,0);
> +    /* restore the previous handler */
> +    sigaction(SIGALRM,&oldact,0);
>  
>      if (run_exit == 0) 
>        {

> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20060112/3354d2d8/attachment.bin


More information about the torquedev mailing list