[Mauiusers] Suspended jobs not being resumed

Edgar Leon edgar at mathcs.emory.edu
Thu Apr 10 14:57:17 MDT 2008


Could someone please help me resolve a problem where suspended jobs
are not being resumed?

checkjob shows the following for a suspended job:
State: Suspended  EState: Running
EState 'Running' does not match current state 'Suspended'
cannot select job 3340 for partition DEFAULT (non-idle expected state 
'Running')


Here are the details:

My batch system has two queues:  default and batch2 (low priority).
Low priority jobs are suspended/resumed.

I submitted 96 jobs to the batch2 queue.  Most of them finished and
two are still running.

Many of these jobs were suspended when high priority jobs were
submitted to the default queue and then they were resumed.

However 10 jobs that were suspended were not resumed and they have
been in this state for many hours:

%qstat
Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
3304.head           job0328          eleon           00:04:46 S batch2
3311.head           job0328          eleon           00:02:19 S batch2
3335.head           job0328          eleon           00:02:19 S batch2
3336.head           job0328          eleon           00:02:22 S batch2
3340.head           job0328          eleon           00:51:01 S batch2
3345.head           job0328          eleon           00:02:17 S batch2
3346.head           job0328          eleon           00:02:13 S batch2
3371.head           job0328          eleon           00:02:30 S batch2
3372.head           job0328          eleon           00:51:47 S batch2
3373.head           job0328          eleon           00:07:35 S batch2
3374.head           job0328          eleon           06:01:17 R batch2
3377.head           job0328          eleon           06:05:41 R batch2

------------------------------------------------------------------

#/usr/local/maui/bin/checkjob  3340


checking job 3340

State: Suspended  EState: Running
Creds:  user:eleon  group:guest  class:batch2  qos:low
WallTime: 8:30:09 of 99:23:59:59
Suspended Wall Time: 16:08:44
SubmitTime: Wed Apr  9 11:16:57
   (Time Queued  Total: 1:01:08:09  Eligible: 8:41:01)

StartDate: -16:37:40  Wed Apr  9 19:47:26
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
NodeCount: 1
Allocated Nodes:
[node009:1]
WARNING:  allocated node          node009 is in state Idle


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE PREEMPTEE
Attr:        PREEMPTEE

EState 'Running' does not match current state 'Suspended'
Reservation '3340' (-1:01:08:07 -> 98:22:51:52  Duration: 99:23:59:59)
PE:  1.00  StartPriority:  531
cannot select job 3340 for partition DEFAULT (non-idle expected state 
'Running')

----------------------------------------------------------------------

The MAUI log shows:

04/10 14:33:56 INFO:     attribute 'PREEMPTEE' set for job 3336
04/10 14:33:56 MPBSJobUpdate(3340,3340.head,TaskList,0)
04/10 14:33:56 INFO:     attribute 'PREEMPTEE' set for job 3340

04/10 14:33:56 INFO:     job '3340' Priority:      531
04/10 14:33:56 INFO:     Cred:     10(00.0)  FS:      0(00.0)  Attr:
    0(00.0)  Serv:    521(00.0)  Targ:      0(00.0)  Res:      0(00.0)
Us:      0(00.0)

-------------------------------------------------------------------

Versions:  maui-3.2.6p18 and torque-2.1.6

Here is the maui.cfg file:

# maui.cfg 3.2p8

SERVERHOST            head
ADMIN1                root
RMCFG[HEAD] TYPE=PBS
AMCFG[bank]  TYPE=NONE
RMPOLLINTERVAL        00:00:30

SERVERPORT            42559
SERVERMODE            NORMAL

LOGFILE               maui.log
LOGFILEMAXSIZE        300000000
LOGLEVEL              3

PREEMPTPOLICY  SUSPEND
RESERVATIONPOLICY     NEVER

QOSWEIGHT          1

CLASSCFG[default] QDEF=high
CLASSCFG[batch2]    QDEF=low

QOSCFG[high]       PRIORITY=200000
QOSCFG[high]       QFLAGS=PREEMPTOR
QOSCFG[low]        PRIORITY=10
QOSCFG[low]        QFLAGS=PREEMPTEE

-------------------------------------------------------------------

I searched the Maui archives and found articles about a patch
that is already incorporated in the software that I am using:
http://www.supercluster.org/pipermail/mauiusers/2004-July/001284.html

-------------------------------------------------------------------

Please let me know if there are any explanations or suggestions.

Thanks.

Edgar


Edgar Leon                                PHONE: (404) 727-2867
Department of Math & Computer Science     FAX:   (404) 727-5611
400 Dowman Drive, Suite W401              EMAIL: edgar at mathcs.emory.edu
Emory University
Atlanta, GA 30322





More information about the mauiusers mailing list