[Mauiusers] jobs taking much time for starting

Arnau Bria arnaubria at pic.es
Fri Sep 19 05:23:38 MDT 2008


Hi all,

we have several jobs that have some free resources where they could
start, but they don't: 

i.e:

# checkjob 528813

checking job 528813

State: Idle
Creds:  user:iatprd003  group:iatprd  class:ifae  qos:ilhcatlas
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Fri Sep 19 10:40:44
  (Time Queued  Total: 2:35:09  Eligible: 00:54:46)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [ifae]


IWD: [NONE]  Executable:  [NONE]
Bypass: 76  StartCount: 0
PartitionMask: [ALL]
PE:  1.00  StartPriority:  86
job can run in partition DEFAULT (2 procs available.  1 procs required)

So, maui says it has available procs, but job does not start. It's the
first Idle job, and we think that this is blocking the entire queue:


IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

528813             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 10:40:44
528814             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 10:40:46
528815             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 10:40:55
528816             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 10:40:56
528817             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 10:41:04
528818             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 10:41:04
529071             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:13:58
529072             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:09
529073             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:09
529074             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:09
529076             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:15
529077             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:15
529078             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:16
529080             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:21
529081             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:22
529082             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:22
529083             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:26
529084             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:27
529085             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:27
529088             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:31
529089             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:33
529090             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:33
529091             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:36
529092             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:38
529093             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:38
529094             iatprd003       Idle     1  3:00:00:00  Fri Sep 19 11:14:40
529858             atlas179       Idle     1  3:00:00:00  Fri Sep 19 13:14:56
529793             atprd020       Idle     1  3:00:00:00  Fri Sep 19 13:05:29
529794             atprd020       Idle     1  3:00:00:00  Fri Sep 19 13:05:30
529795             atprd020       Idle     1  3:00:00:00  Fri Sep 19 13:05:31
529796             atprd020       Idle     1  3:00:00:00  Fri Sep 19 13:05:32


If we check next job from diff user:

[root at pbs02 sbin]# checkjob 529858


checking job 529858

State: Idle
Creds:  user:atlas179  group:atlas  class:glong  qos:lhcatlas
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Fri Sep 19 13:14:56
  (Time Queued  Total: 00:02:58  Eligible: 00:02:58)

StartDate: -00:02:01  Fri Sep 19 13:15:53
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [slc4]


IWD: [NONE]  Executable:  [NONE]
Bypass: 4  StartCount: 0
PartitionMask: [ALL]
PE:  1.00  StartPriority:  -12
job can run in partition DEFAULT (283 procs available.  1 procs required)


so, it has 283 procs avaliable but it doesn't start....


Those are maui logs refering to the first job:

]# grep 528813 /var/log/maui.log.1
09/19 10:42:16 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221813736
processing node request line '1'
09/19 10:44:18 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:46:22 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:48:25 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:50:28 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:50:39 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221814238
processing node request line '1'
09/19 10:50:40 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:53:03 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221814383
processing node request line '1'
09/19 10:53:04 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:54:50 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221814489
processing node request line '1'
09/19 10:57:48 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221814667
processing node request line '1'
09/19 11:00:05 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221814804
processing node request line '1'
09/19 11:16:26 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221815785
processing node request line '1'
09/19 12:13:17 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:15:18 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:17:19 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:19:20 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:21:21 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:23:22 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:25:23 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:27:24 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:29:25 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:31:26 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:33:27 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:35:28 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:37:50 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:46:23 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221821182
processing node request line '1'
09/19 12:46:24 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:48:32 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221821312
processing node request line '1'
09/19 12:48:47 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221821327
processing node request line '1'
09/19 12:48:48 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:50:58 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:53:01 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:55:04 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:57:07 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:57:59 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221821878
processing node request line '1'
09/19 12:57:59 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:58:07 INFO:     job '528813' loaded:   1 iatprd003   iatprd 259200       Idle   0 1221813644   [NONE] [NONE] [NONE] >=      0 >=      0 [ifae] 1221821886
processing node request line '1'
09/19 12:58:07 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)

# grep " cannot set job" /var/log/maui.log|grep 529813.pbs02.pic.es
09/19 13:19:55 WARNING:  cannot set job '529813.pbs02.pic.es' attr 'Resource_List:neednodes' to '1' (rc: 15001 'Unknown Job Id')

Finally, I'm able to force its start:

]# runjob 528813


job '528813' started on 1 proc



So, could some one give us some hint on why is maui not scheduling a
job when there are avalible resources?¿


TIA,
Arnau


More information about the mauiusers mailing list