[Mauiusers] Suspended jobs not being resumed

Edgar Leon edgar at mathcs.emory.edu
Wed Apr 16 16:01:59 MDT 2008


Ronny,

> I seem to vaguely remember a problem I had a while ago: suspend jobs 
> would not age and as such increase their priority again.

I enabled aging (USAGEEXECUTIONTIMEWEIGHT = 1) and most of the suspended
jobs will age and will resume.

However there are four jobs that are suspended and that will not age.
They will also not resume and have been in this state for 6 hours:

Job id              Name             User            Time Use S Queue
8229.head           job0328          eleon           00:01:21 S batch2
8233.head           job0328          eleon           00:01:55 S batch2
8240.head           job0328          eleon           00:01:57 S batch2
8242.head           job0328          eleon           00:01:58 S batch2

checkjob shows that the State and Estate do not match for these four jobs:

checking job 8229

State: Suspended  EState: Running
Creds:  user:eleon  group:guest  class:batch2  qos:low
WallTime: 1:10:44 of 99:23:59:59
Suspended Wall Time: 5:05:29
SubmitTime: Wed Apr 16 11:42:11
   (Time Queued  Total: 6:04:42  Eligible: 00:30:05)

StartDate: -5:35:00  Wed Apr 16 12:11:53
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
NodeCount: 1
Allocated Nodes:
[node026:1]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE PREEMPTEE
Attr:        PREEMPTEE

EState 'Running' does not match current state 'Suspended'
Reservation '8229' (-6:04:38 -> 99:17:55:21  Duration: 99:23:59:59)
PE:  1.00  StartPriority:  40
cannot select job 8229 for partition DEFAULT (non-idle expected state 
'Running')

----------------------------------------------------------------------

Any ideas how to get a job out of this state (other than restarting 
maui?) or what causes this condition?

For comparison purposes, here is the checkjob output of a suspended
job that ages and resumes:

checking job 8210

State: Suspended
Creds:  user:eleon  group:guest  class:batch2  qos:low
WallTime: 6:02:57 of 99:23:59:59
Suspended Wall Time: 00:00:30
SubmitTime: Wed Apr 16 11:41:52
   (Time Queued  Total: 5:52:35  Eligible: 00:37:53)

StartDate: 00:00:10  Wed Apr 16 17:34:37
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
NodeCount: 1
Allocated Nodes:
[node012:1]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE PREEMPTEE
Attr:        PREEMPTEE

PE:  1.00  StartPriority:  47
cannot select job 8210 for partition DEFAULT (startdate in '00:00:10')

-------------------------------------------------------------------------

Thanks.

Edgar


> To work-around this you will have to change your config as detailed in 
> (here you can find my original problem report)
> 
> http://osdir.com/ml/clustering.maui.user/2006-08/msg00021.html




More information about the mauiusers mailing list