[Mauiusers] Suspended jobs not being resumed
Edgar Leon
edgar at mathcs.emory.edu
Wed Apr 16 16:01:59 MDT 2008
Ronny,
> I seem to vaguely remember a problem I had a while ago: suspend jobs
> would not age and as such increase their priority again.
I enabled aging (USAGEEXECUTIONTIMEWEIGHT = 1) and most of the suspended
jobs will age and will resume.
However there are four jobs that are suspended and that will not age.
They will also not resume and have been in this state for 6 hours:
Job id Name User Time Use S Queue
8229.head job0328 eleon 00:01:21 S batch2
8233.head job0328 eleon 00:01:55 S batch2
8240.head job0328 eleon 00:01:57 S batch2
8242.head job0328 eleon 00:01:58 S batch2
checkjob shows that the State and Estate do not match for these four jobs:
checking job 8229
State: Suspended EState: Running
Creds: user:eleon group:guest class:batch2 qos:low
WallTime: 1:10:44 of 99:23:59:59
Suspended Wall Time: 5:05:29
SubmitTime: Wed Apr 16 11:42:11
(Time Queued Total: 6:04:42 Eligible: 00:30:05)
StartDate: -5:35:00 Wed Apr 16 12:11:53
Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
NodeCount: 1
Allocated Nodes:
[node026:1]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE PREEMPTEE
Attr: PREEMPTEE
EState 'Running' does not match current state 'Suspended'
Reservation '8229' (-6:04:38 -> 99:17:55:21 Duration: 99:23:59:59)
PE: 1.00 StartPriority: 40
cannot select job 8229 for partition DEFAULT (non-idle expected state
'Running')
----------------------------------------------------------------------
Any ideas how to get a job out of this state (other than restarting
maui?) or what causes this condition?
For comparison purposes, here is the checkjob output of a suspended
job that ages and resumes:
checking job 8210
State: Suspended
Creds: user:eleon group:guest class:batch2 qos:low
WallTime: 6:02:57 of 99:23:59:59
Suspended Wall Time: 00:00:30
SubmitTime: Wed Apr 16 11:41:52
(Time Queued Total: 5:52:35 Eligible: 00:37:53)
StartDate: 00:00:10 Wed Apr 16 17:34:37
Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
NodeCount: 1
Allocated Nodes:
[node012:1]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE PREEMPTEE
Attr: PREEMPTEE
PE: 1.00 StartPriority: 47
cannot select job 8210 for partition DEFAULT (startdate in '00:00:10')
-------------------------------------------------------------------------
Thanks.
Edgar
> To work-around this you will have to change your config as detailed in
> (here you can find my original problem report)
>
> http://osdir.com/ml/clustering.maui.user/2006-08/msg00021.html
More information about the mauiusers
mailing list