[torqueusers] Torque not deleting job

Adam Emerich aemerich at us.ibm.com
Thu Apr 19 10:41:59 MDT 2007


I am seeing a case in which torque does not delete an interactive job if
the node on which the job is running goes down.  Here is what I am doing:

   qsub -I -l nodes=n01-01-06:ppn=1       -> successfully returns a prompt
   on the machine requested

Then the node (n01-01-06) is reboot.  After the reboot "top" on n01-01-06
does not show any jobs being run by my userid.  However, "showq" shows the
following on the torque server:

   ACTIVE JOBS--------------------
   JOBNAME            USERNAME      STATE  PROC   REMAINING
   STARTTIME

   1119                 cengel    Running     8    00:46:11  Thu Apr 19
   03:45:19
   1115                 cengel    Running     4     1:38:53  Thu Apr 19
   04:38:01
   1123                 cengel    Running     8     1:39:42  Thu Apr 19
   04:38:50
   1124                 cengel    Running     8     1:39:49  Thu Apr 19
   04:38:57
   1125                 cengel    Running     8     4:54:01  Thu Apr 19
   07:53:09
   1118                 cengel    Running     4     4:56:09  Thu Apr 19
   07:55:17
   1131               aemerich    Running     1    11:53:54  Thu Apr 19
   08:53:02

        7 Active Jobs      41 of   48 Processors Active (85.42%)
                            7 of    7 Nodes Active      (100.00%)

   IDLE JOBS----------------------
   JOBNAME            USERNAME      STATE  PROC     WCLIMIT
   QUEUETIME

   1135               aemerich       Idle     8    12:00:00  Thu Apr 19
   08:59:07
   1121                 cengel       Idle     4     6:00:00  Thu Apr 19
   01:57:09
   1126                 cengel       Idle     4     6:00:00  Thu Apr 19
   04:47:11
   1129                 cengel       Idle     4     6:00:00  Thu Apr 19
   07:57:13
   1132                 cengel       Idle     8     6:00:00  Thu Apr 19
   08:57:14
   1133                 cengel       Idle     8     6:00:00  Thu Apr 19
   08:57:14
   1134                 cengel       Idle     8     6:00:00  Thu Apr 19
   08:57:14

   7 Idle Jobs

   BLOCKED JOBS----------------
   JOBNAME            USERNAME      STATE  PROC     WCLIMIT
   QUEUETIME


   Total Jobs: 14   Active Jobs: 7   Idle Jobs: 7   Blocked Jobs: 0

A "checkjob" on job 1131 above shows the following:

   checking job 1131

   State: Running
   Creds:  user:aemerich  group:users  class:dque  qos:dque
   WallTime: 00:05:13 of 12:00:00
   SubmitTime: Thu Apr 19 08:53:01
     (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

   StartTime: Thu Apr 19 08:53:02
   Total Tasks: 1

   Req[0]  TaskCount: 1  Partition: DEFAULT
   Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
   Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
   NodeCount: 1
   Allocated Nodes:
   [n01-01-06:1]


   IWD: [NONE]  Executable:  [NONE]
   Bypass: 0  StartCount: 1
   PartitionMask: [ALL]
   Flags:       HOSTLIST PREEMPTOR
   HostList:
     [n01-01-06:1]
   Reservation '1131' (-00:05:15 -> 11:54:45  Duration: 12:00:00)
   PE:  1.00  StartPriority:  1000

As you can see the server still thinks there is a job running, but upon the
reboot of the compute node the interactive job returned a message that
said:

   qsub: job 1131.rrmaster completed

The end result is that if a user tries to submit a job which requires all
the processors after the reboot, that job will be held up waiting for a
fictitious job to complete.  In our case, the walltime is 12 hours on
interactive jobs so the new submission may wait up to 12 hours before being
submitted to the compute node.

Versions of Torque and Maui being used:
Torque 2.1.8
Maui 3.2.6p19

Any input would be greatly appreciated.

Adam Emerich



More information about the torqueusers mailing list