[torqueusers] Torque not deleting job
Adam Emerich
aemerich at us.ibm.com
Thu Apr 19 10:41:59 MDT 2007
I am seeing a case in which torque does not delete an interactive job if
the node on which the job is running goes down. Here is what I am doing:
qsub -I -l nodes=n01-01-06:ppn=1 -> successfully returns a prompt
on the machine requested
Then the node (n01-01-06) is reboot. After the reboot "top" on n01-01-06
does not show any jobs being run by my userid. However, "showq" shows the
following on the torque server:
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING
STARTTIME
1119 cengel Running 8 00:46:11 Thu Apr 19
03:45:19
1115 cengel Running 4 1:38:53 Thu Apr 19
04:38:01
1123 cengel Running 8 1:39:42 Thu Apr 19
04:38:50
1124 cengel Running 8 1:39:49 Thu Apr 19
04:38:57
1125 cengel Running 8 4:54:01 Thu Apr 19
07:53:09
1118 cengel Running 4 4:56:09 Thu Apr 19
07:55:17
1131 aemerich Running 1 11:53:54 Thu Apr 19
08:53:02
7 Active Jobs 41 of 48 Processors Active (85.42%)
7 of 7 Nodes Active (100.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
1135 aemerich Idle 8 12:00:00 Thu Apr 19
08:59:07
1121 cengel Idle 4 6:00:00 Thu Apr 19
01:57:09
1126 cengel Idle 4 6:00:00 Thu Apr 19
04:47:11
1129 cengel Idle 4 6:00:00 Thu Apr 19
07:57:13
1132 cengel Idle 8 6:00:00 Thu Apr 19
08:57:14
1133 cengel Idle 8 6:00:00 Thu Apr 19
08:57:14
1134 cengel Idle 8 6:00:00 Thu Apr 19
08:57:14
7 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
Total Jobs: 14 Active Jobs: 7 Idle Jobs: 7 Blocked Jobs: 0
A "checkjob" on job 1131 above shows the following:
checking job 1131
State: Running
Creds: user:aemerich group:users class:dque qos:dque
WallTime: 00:05:13 of 12:00:00
SubmitTime: Thu Apr 19 08:53:01
(Time Queued Total: 00:00:01 Eligible: 00:00:01)
StartTime: Thu Apr 19 08:53:02
Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
NodeCount: 1
Allocated Nodes:
[n01-01-06:1]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: HOSTLIST PREEMPTOR
HostList:
[n01-01-06:1]
Reservation '1131' (-00:05:15 -> 11:54:45 Duration: 12:00:00)
PE: 1.00 StartPriority: 1000
As you can see the server still thinks there is a job running, but upon the
reboot of the compute node the interactive job returned a message that
said:
qsub: job 1131.rrmaster completed
The end result is that if a user tries to submit a job which requires all
the processors after the reboot, that job will be held up waiting for a
fictitious job to complete. In our case, the walltime is 12 hours on
interactive jobs so the new submission may wait up to 12 hours before being
submitted to the compute node.
Versions of Torque and Maui being used:
Torque 2.1.8
Maui 3.2.6p19
Any input would be greatly appreciated.
Adam Emerich
More information about the torqueusers
mailing list