[Mauiusers] jobs not starting when avaliable resources
Arnau Bria
arnaubria at pic.es
Tue Oct 7 01:41:38 MDT 2008
Hi all,
Some jobs keep on top of IDLE jobs, and don't let the rest start (jobs
from other queues that have nothing to do with these ones).
Looking at them, I see they have resources to start running, but they
don't do:
[root at pbs02 ~]# checkjob -v 672949
checking job 672949 (RM job '672949.pbs02.pic.es')
State: Idle
Creds: user:iatprd045 group:iatprd class:ifae qos:ilhcatlas
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Tue Oct 7 06:35:52
(Time Queued Total: 3:02:20 Eligible: 1:20:42)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [ifae]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SHARED
NodeCount: 0
IWD: [NONE] Executable: [NONE]
Bypass: 12 StartCount: 0
PartitionMask: [ALL]
SystemQueueTime: Tue Oct 7 08:17:30
PE: 1.00 StartPriority: 82
job can run in partition DEFAULT (17 procs available. 1 procs required)
]# diagnose -j 672949
Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features
672949 Idle ALL 1 ilh 3:00:00:00 0 1 iatprd04 iatprd - 1:22:43 [NONE] [NONE] [NONE] >=0 >=0 NC0 [ifae:1] [ifae]
There are some nodes where they coudl start:
td204.pic.es
state = free
np = 4
properties = ifae
--
td203.pic.es
state = free
np = 4
properties = ifae
# checknode td204.pic.es
checking node td204.pic.es
State: Running (in current state for 00:00:00)
Expected State: Idle SyncDeadline: Sat Oct 24 14:26:40
Configured Resources: PROCS: 4 MEM: 8115M SWAP: 8115M DISK: 15G
Utilized Resources: DISK: 4752M
Dedicated Resources: PROCS: 3
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 3.000
Network: [DEFAULT]
Features: [ifae]
Attributes: [Batch]
Classes: [long 4:4][medium 4:4][short 4:4][ifae 1:4][gshort 4:4][glong 4:4][gmedium 4:4][lhcbsl4 4:4][magic 4:4][roman 4:4]
Total Time: 58:11:34:08 Up: 58:10:24:24 (99.92%) Active: 41:19:36:22 (71.50%)
Reservations:
Job '672291'(x1) -6:17:17 -> 2:17:42:43 (3:00:00:00)
Job '672297'(x1) -6:15:47 -> 2:17:44:13 (3:00:00:00)
Job '672924'(x1) -3:05:22 -> 2:20:54:38 (3:00:00:00)
JobList: 672291,672297,672924
]# diagnose -n td204.pic.es
diagnosing node table (5120 slots)
Name State Procs Memory Disk Swap Speed Opsys Arch Par Load Res Classes Network Features
td204.pic.es Running 1:4 8115:8115 10635:15387 8115:8115 1.00 linux [NONE] DEF 3.00 003 [long_4:4][medium_4:4][short_4 [DEFAULT] [ifae]
----- --- 1:4 8115:8115 10635:15387 8115:8115
Total Nodes: 1 (Active: 1 Idle: 0 Down: 0)
If I force them (runnjob) they start, but meanwhile, I have a looong
queueu wuth many jobs that could also start in other queues.
Where may I start looking for the source of this problem?
Cheers,
Arnau
More information about the mauiusers
mailing list