[torquedev] pbs_moms dying from SIGALRM

Marcus R. Epperson mrepper at sandia.gov
Wed Jan 11 17:12:40 MST 2006


We have been seeing pbs_mom's die somewhat regularly on our machines and have tracked down the apparent cause.  It seems that under heavy load, some of the alarm calls in is_update_stat() were expiring, and since run_pelog() had reset the SIGALRM handler set to SIG_DFL, this was fatal to the mom (and the job, subsequently).

The attached patch fixes this by making run_pelog() restore the previous handler rather than forcing it to be SIG_DFL.  With this patch applied, the mom runs the original SIGALRM handler (toolong()) rather than dying.  This handler doesn't have any real effect, other than informing you that some action took longer than expected, but that seems to be more correct than the alternative.

To simulate the failure:
- Watch the mom_logs while you run the following on a node that's not part of a job:
    # killall -ALRM pbs_mom
  You should see a log message similar to this:
    01/11/2006 15:53:51;0002;   pbs_mom;n/a;toolong;alarm call
- Then launch a job on the node, wait for the JOIN JOB message, and then run the killall command again.  On our systems this was killing the mom process immediately.

-- 
--------------------------------------------------
Marcus R. Epperson -- SNL
--------------------------------------------------
--
-------------- next part --------------
--- src/resmom/prolog.c-orig	2006-01-11 15:44:10.000000000 -0700
+++ src/resmom/prolog.c	2006-01-11 15:45:36.000000000 -0700
@@ -306,7 +306,7 @@ int run_pelog(
   {
   char *id = "run_pelog";
 
-  struct sigaction act;
+  struct sigaction act, oldact;
   char	        *arg[11];
   int		 fds1 = 0;
   int		 fds2 = 0;
@@ -388,7 +388,7 @@ int run_pelog(
 
     act.sa_flags = 0;
 
-    sigaction(SIGALRM,&act,0);
+    sigaction(SIGALRM,&act,&oldact);
 
     /* it would be nice if the harvest routine could block for 5 seconds, 
        and if the prolog is not complete in that time, mark job as prolog 
@@ -446,9 +446,8 @@ int run_pelog(
 
     alarm(0);
 
-    act.sa_handler = SIG_DFL;
-
-    sigaction(SIGALRM,&act,0);
+    /* restore the previous handler */
+    sigaction(SIGALRM,&oldact,0);
 
     if (run_exit == 0) 
       {


More information about the torquedev mailing list