[torqueusers] [RESENT]: endless 'requeuing delete request's in 1.2.0p6

Wolfgang Wander wwc at rentec.com
Thu Sep 15 08:23:38 MDT 2005


Hi,

   having a job in PRERUN state while the mom dies or shuts down
results in the server getting endlessly busy trying to delete the
job after the first qdel attempt.  Each new qdel just make the
server busier...

09/14/2005 09:33:26;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:33:26;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:33:27;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:33:27;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:33:27;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:33:28;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:33:28;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:33:28;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:33:29;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:33:29;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:33:29;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
[...]

The following patch adds some state into the PRERUN delete queue and
should drain the queue out eventually.  There may be better solutions
but some way to end the loop is certainly required:

09/14/2005 09:37:11;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:37:12;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:37:13;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:37:14;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:37:15;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:37:16;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:37:17;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:37:18;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:37:19;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:37:20;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:37:21;0008;PBS_Server;Job;1283.da-server;job cannot be deleted, state = PRERUN, requeuing delete request
09/14/2005 09:37:22;0008;PBS_Server;Job;1283.da-server;Job deleted at request of root at da-client
09/14/2005 09:37:22;0008;PBS_Server;Job;1283.da-server;Job sent signal SIGTERM on delete
09/14/2005 09:37:22;0008;PBS_Server;Job;1283.da-server;MOM rejected signal during delete


*** src/server/req_delete.c.orig	Wed Sep 14 09:35:01 2005
--- src/server/req_delete.c	Wed Sep 14 09:35:11 2005
***************
*** 283,288 ****
--- 283,307 ----
      /* being sent to MOM, wait till she gets it going */
      /* retry in one second				  */
  
+     static time_t cycle_check_when = 0;
+     static int    cycle_check_id   = -1;
+ 
+     if( cycle_check_when ) {
+       if( pjob->ji_qs.ji_jobid == cycle_check_id && time_now - cycle_check_when > 10 ) {
+         /* did the mom ever get it? delete it anyways... */
+         cycle_check_id = -1;
+         cycle_check_when = 0;
+         goto jump;
+       } else if( time_now - cycle_check_when > 20 ) {
+         cycle_check_id = -1;
+         cycle_check_when = 0;
+       }
+     } 
+     if( !cycle_check_when ) {
+       cycle_check_when = time_now;
+       cycle_check_id = pjob->ji_qs.ji_jobid;
+     }
+ 
      sprintf(log_buffer,"job cannot be deleted, state = PRERUN, requeuing delete request");
        
      log_event(
***************
*** 303,308 ****
--- 322,329 ----
      return;
      }
  
+ jump:
+ 
    /*
     * Log delete and if requesting client is not job owner, send mail.
     */


More information about the torqueusers mailing list