[Mauiusers] jobs taking much time for starting
Arnau Bria
arnaubria at pic.es
Fri Sep 19 05:23:38 MDT 2008
Hi all,
we have several jobs that have some free resources where they could
start, but they don't:
i.e:
# checkjob 528813
checking job 528813
State: Idle
Creds: user:iatprd003 group:iatprd class:ifae qos:ilhcatlas
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Fri Sep 19 10:40:44
(Time Queued Total: 2:35:09 Eligible: 00:54:46)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [ifae]
IWD: [NONE] Executable: [NONE]
Bypass: 76 StartCount: 0
PartitionMask: [ALL]
PE: 1.00 StartPriority: 86
job can run in partition DEFAULT (2 procs available. 1 procs required)
So, maui says it has available procs, but job does not start. It's the
first Idle job, and we think that this is blocking the entire queue:
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
528813 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:40:44
528814 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:40:46
528815 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:40:55
528816 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:40:56
528817 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:41:04
528818 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:41:04
529071 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:13:58
529072 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:09
529073 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:09
529074 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:09
529076 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:15
529077 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:15
529078 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:16
529080 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:21
529081 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:22
529082 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:22
529083 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:26
529084 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:27
529085 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:27
529088 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:31
529089 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:33
529090 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:33
529091 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:36
529092 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:38
529093 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:38
529094 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:40
529858 atlas179 Idle 1 3:00:00:00 Fri Sep 19 13:14:56
529793 atprd020 Idle 1 3:00:00:00 Fri Sep 19 13:05:29
529794 atprd020 Idle 1 3:00:00:00 Fri Sep 19 13:05:30
529795 atprd020 Idle 1 3:00:00:00 Fri Sep 19 13:05:31
529796 atprd020 Idle 1 3:00:00:00 Fri Sep 19 13:05:32
If we check next job from diff user:
[root at pbs02 sbin]# checkjob 529858
checking job 529858
State: Idle
Creds: user:atlas179 group:atlas class:glong qos:lhcatlas
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Fri Sep 19 13:14:56
(Time Queued Total: 00:02:58 Eligible: 00:02:58)
StartDate: -00:02:01 Fri Sep 19 13:15:53
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [slc4]
IWD: [NONE] Executable: [NONE]
Bypass: 4 StartCount: 0
PartitionMask: [ALL]
PE: 1.00 StartPriority: -12
job can run in partition DEFAULT (283 procs available. 1 procs required)
so, it has 283 procs avaliable but it doesn't start....
Those are maui logs refering to the first job:
]# grep 528813 /var/log/maui.log.1
09/19 10:42:16 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221813736
processing node request line '1'
09/19 10:44:18 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:46:22 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:48:25 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:50:28 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:50:39 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221814238
processing node request line '1'
09/19 10:50:40 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:53:03 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221814383
processing node request line '1'
09/19 10:53:04 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 10:54:50 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221814489
processing node request line '1'
09/19 10:57:48 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221814667
processing node request line '1'
09/19 11:00:05 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221814804
processing node request line '1'
09/19 11:16:26 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221815785
processing node request line '1'
09/19 12:13:17 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:15:18 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:17:19 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:19:20 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:21:21 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:23:22 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:25:23 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:27:24 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:29:25 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:31:26 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:33:27 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:35:28 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:37:50 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:46:23 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221821182
processing node request line '1'
09/19 12:46:24 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:48:32 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221821312
processing node request line '1'
09/19 12:48:47 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221821327
processing node request line '1'
09/19 12:48:48 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:50:58 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:53:01 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:55:04 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:57:07 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:57:59 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221821878
processing node request line '1'
09/19 12:57:59 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
09/19 12:58:07 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221821886
processing node request line '1'
09/19 12:58:07 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej)
# grep " cannot set job" /var/log/maui.log|grep 529813.pbs02.pic.es
09/19 13:19:55 WARNING: cannot set job '529813.pbs02.pic.es' attr 'Resource_List:neednodes' to '1' (rc: 15001 'Unknown Job Id')
Finally, I'm able to force its start:
]# runjob 528813
job '528813' started on 1 proc
So, could some one give us some hint on why is maui not scheduling a
job when there are avalible resources?¿
TIA,
Arnau
More information about the mauiusers
mailing list