[Mauiusers] jobs not starting when avaliable resources
Tom Rudwick
tomr at intrinsity.com
Tue Oct 7 11:52:38 MDT 2008
We were having a lot of problems with that until we increased our
RESERVATIONDEPTH. We set ours up to 100 and it seems to have helped
the problem. From the documentation, it seems that it would not make
a difference, but in our case it seems to.
Tom
Arnau Bria wrote:
> Hi all,
>
> Some jobs keep on top of IDLE jobs, and don't let the rest start (jobs
> from other queues that have nothing to do with these ones).
>
> Looking at them, I see they have resources to start running, but they
> don't do:
>
>
> [root at pbs02 ~]# checkjob -v 672949
>
>
> checking job 672949 (RM job '672949.pbs02.pic.es')
>
> State: Idle
> Creds: user:iatprd045 group:iatprd class:ifae qos:ilhcatlas
> WallTime: 00:00:00 of 3:00:00:00
> SubmitTime: Tue Oct 7 06:35:52
> (Time Queued Total: 3:02:20 Eligible: 1:20:42)
>
> Total Tasks: 1
>
> Req[0] TaskCount: 1 Partition: ALL
> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
> Opsys: [NONE] Arch: [NONE] Features: [ifae]
> Exec: '' ExecSize: 0 ImageSize: 0
> Dedicated Resources Per Task: PROCS: 1
> NodeAccess: SHARED
> NodeCount: 0
>
>
> IWD: [NONE] Executable: [NONE]
> Bypass: 12 StartCount: 0
> PartitionMask: [ALL]
> SystemQueueTime: Tue Oct 7 08:17:30
>
> PE: 1.00 StartPriority: 82
> job can run in partition DEFAULT (17 procs available. 1 procs required)
>
>
> ]# diagnose -j 672949
> Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features
>
> 672949 Idle ALL 1 ilh 3:00:00:00 0 1 iatprd04 iatprd - 1:22:43 [NONE] [NONE] [NONE] >=0 >=0 NC0 [ifae:1] [ifae]
>
>
> There are some nodes where they coudl start:
>
> td204.pic.es
> state = free
> np = 4
> properties = ifae
> --
>
> td203.pic.es
> state = free
> np = 4
> properties = ifae
>
>
> # checknode td204.pic.es
>
>
> checking node td204.pic.es
>
> State: Running (in current state for 00:00:00)
> Expected State: Idle SyncDeadline: Sat Oct 24 14:26:40
> Configured Resources: PROCS: 4 MEM: 8115M SWAP: 8115M DISK: 15G
> Utilized Resources: DISK: 4752M
> Dedicated Resources: PROCS: 3
> Opsys: linux Arch: [NONE]
> Speed: 1.00 Load: 3.000
> Network: [DEFAULT]
> Features: [ifae]
> Attributes: [Batch]
> Classes: [long 4:4][medium 4:4][short 4:4][ifae 1:4][gshort 4:4][glong 4:4][gmedium 4:4][lhcbsl4 4:4][magic 4:4][roman 4:4]
>
> Total Time: 58:11:34:08 Up: 58:10:24:24 (99.92%) Active: 41:19:36:22 (71.50%)
>
> Reservations:
> Job '672291'(x1) -6:17:17 -> 2:17:42:43 (3:00:00:00)
> Job '672297'(x1) -6:15:47 -> 2:17:44:13 (3:00:00:00)
> Job '672924'(x1) -3:05:22 -> 2:20:54:38 (3:00:00:00)
> JobList: 672291,672297,672924
>
>
> ]# diagnose -n td204.pic.es
> diagnosing node table (5120 slots)
> Name State Procs Memory Disk Swap Speed Opsys Arch Par Load Res Classes Network Features
>
> td204.pic.es Running 1:4 8115:8115 10635:15387 8115:8115 1.00 linux [NONE] DEF 3.00 003 [long_4:4][medium_4:4][short_4 [DEFAULT] [ifae]
> ----- --- 1:4 8115:8115 10635:15387 8115:8115
>
> Total Nodes: 1 (Active: 1 Idle: 0 Down: 0)
>
>
>
>
> If I force them (runnjob) they start, but meanwhile, I have a looong
> queueu wuth many jobs that could also start in other queues.
>
> Where may I start looking for the source of this problem?
>
>
> Cheers,
> Arnau
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers
>
More information about the mauiusers
mailing list