[Mauiusers] Backfill Jobs prevent PREEMPTOR jobs from running?
Edsall, William (WJ)
WJEdsall at dow.com
Thu Feb 16 10:19:49 MST 2012
Hello,
Diagnose -p was truncated. I was hoping to see that 33-35 (Queued) did
not have a large QTime which may be increasing their priority higher
than your job 38. That could cause them to make job 38 wait even though
they are not running. Sounds doubtful in your scenario but I've seen it
cause issues before.
If you delete the Q state jobs 33-35, does your 38 start?
We use the same preemption concept you're trying to achieve but I'm
having a hard time narrowing down the cause for your error. A few small
differences with our configuration is the backfill policy and
reservation policy. You might try these settings and then restart maui:
BACKFILLPOLICY BESTFIT
RESERVATIONPOLICY CURRENTHIGHEST
From: Joseph Farran [mailto:jfarran at uci.edu]
Sent: Thursday, February 16, 2012 11:48 AM
To: Edsall, William (WJ); mauiusers at supercluster.org
Subject: Re: [Mauiusers] Backfill Jobs prevent PREEMPTOR jobs from
running?
Hi Edsall.
Thank you for responding. I have a few more nodes now, but the same
configuration. I am including the diagnose -p with other details:
We have 13 64-core nodes. All nodes have the 'free' feature and a queue
named 'free' as PREEMPTEE so that we can harvest idle cycles when the
nodes are not in use by their owners.
As user "juser", I load up the 'free' queue (PREEMTEE) as follows:
1.hpc.cluster. juser free test 24904 1 63
-- 72:00 R 00:01
2.hpc.cluster. juser free test 29346 1 63
-- 72:00 R 00:01
3.hpc.cluster. juser free test 42900 1 63
-- 72:00 R 00:01
4.hpc.cluster. juser free test 30291 1 63
-- 72:00 R 00:01
5.hpc.cluster. juser free test 26417 1 63
-- 72:00 R 00:01
6.hpc.cluster. juser free test 40206 1 63
-- 72:00 R 00:01
7.hpc.cluster. juser free test 1786 1 63
-- 72:00 R 00:01
8.hpc.cluster. juser free test 62436 1 63
-- 72:00 R 00:01
9.hpc.cluster. juser free test 49087 1 63
-- 72:00 R 00:01
10.hpc.cluster juser free test 45691 1 63
-- 72:00 R 00:01
11.hpc.cluster juser free test 41386 1 63
-- 72:00 R 00:01
12.hpc.cluster juser free test 35204 1 63
-- 72:00 R 00:01
13.hpc.cluster juser free test 51043 1 63
-- 72:00 R 00:01
14.hpc.cluster juser free test 24948 1 1
-- 72:00 R 00:01
15.hpc.cluster juser free test 29390 1 1
-- 72:00 R 00:01
16.hpc.cluster juser free test 42944 1 1
-- 72:00 R 00:01
17.hpc.cluster juser free test 30335 1 1
-- 72:00 R 00:01
18.hpc.cluster juser free test 26461 1 1
-- 72:00 R 00:01
19.hpc.cluster juser free test 40250 1 1
-- 72:00 R 00:01
20.hpc.cluster juser free test 1830 1 1
-- 72:00 R 00:01
21.hpc.cluster juser free test 62480 1 1
-- 72:00 R 00:01
22.hpc.cluster juser free test 49131 1 1
-- 72:00 R 00:01
23.hpc.cluster juser free test 45735 1 1
-- 72:00 R 00:01
24.hpc.cluster juser free test 41430 1 1
-- 72:00 R 00:01
25.hpc.cluster juser free test 35248 1 1
-- 72:00 R 00:01
26.hpc.cluster juser free test 51087 1 1
-- 72:00 R 00:01
27.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
28.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
29.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
30.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
31.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
32.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
33.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
34.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
35.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
As user "tw" which owes the 'tw' nodes, I run:
qsub -I -q tw -l nodes=6:ppn=64
And preeption works as expected:
1.hpc.cluster. juser free test 24904 1 63
-- 72:00 R 00:02
2.hpc.cluster. juser free test 29346 1 63
-- 72:00 S 00:01
3.hpc.cluster. juser free test 42900 1 63
-- 72:00 S 00:01
4.hpc.cluster. juser free test 30291 1 63
-- 72:00 S 00:01
5.hpc.cluster. juser free test 26417 1 63
-- 72:00 S 00:01
6.hpc.cluster. juser free test 40206 1 63
-- 72:00 S 00:01
7.hpc.cluster. juser free test 1786 1 63
-- 72:00 S 00:01
8.hpc.cluster. juser free test 62436 1 63
-- 72:00 R 00:01
9.hpc.cluster. juser free test 49087 1 63
-- 72:00 R 00:02
10.hpc.cluster juser free test 45691 1 63
-- 72:00 R 00:02
11.hpc.cluster juser free test 41386 1 63
-- 72:00 R 00:02
12.hpc.cluster juser free test 35204 1 63
-- 72:00 R 00:02
13.hpc.cluster juser free test 51043 1 63
-- 72:00 R 00:02
14.hpc.cluster juser free test 24948 1 1
-- 72:00 R 00:02
15.hpc.cluster juser free test 29390 1 1
-- 72:00 S 00:02
16.hpc.cluster juser free test 42944 1 1
-- 72:00 S 00:01
17.hpc.cluster juser free test 30335 1 1
-- 72:00 S 00:01
18.hpc.cluster juser free test 26461 1 1
-- 72:00 S 00:01
19.hpc.cluster juser free test 40250 1 1
-- 72:00 S 00:01
20.hpc.cluster juser free test 1830 1 1
-- 72:00 S 00:01
21.hpc.cluster juser free test 62480 1 1
-- 72:00 R 00:02
22.hpc.cluster juser free test 49131 1 1
-- 72:00 R 00:02
23.hpc.cluster juser free test 45735 1 1
-- 72:00 R 00:02
24.hpc.cluster juser free test 41430 1 1
-- 72:00 R 00:02
25.hpc.cluster juser free test 35248 1 1
-- 72:00 R 00:02
26.hpc.cluster juser free test 51087 1 1
-- 72:00 R 00:02
27.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
28.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
29.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
30.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
31.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
32.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
33.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
34.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
35.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
36.hpc.cluster tw tw STDIN 30505 6 384
-- 99:00 R --
As user 'tw', I exit and run the command:
qsub -I -q tw -l nodes=6:ppn=62
Everything works again as expected and Maui also starts 6 new 1-core
jobs ( jobs 21 through 26 ):
1.hpc.cluster. juser free test 24904 1 63
-- 72:00 R 00:03
2.hpc.cluster. juser free test 29346 1 63
-- 72:00 S 00:01
3.hpc.cluster. juser free test 42900 1 63
-- 72:00 S 00:01
4.hpc.cluster. juser free test 30291 1 63
-- 72:00 S 00:01
5.hpc.cluster. juser free test 26417 1 63
-- 72:00 S 00:01
6.hpc.cluster. juser free test 40206 1 63
-- 72:00 S 00:02
7.hpc.cluster. juser free test 1786 1 63
-- 72:00 S 00:02
8.hpc.cluster. juser free test 62436 1 63
-- 72:00 R 00:03
9.hpc.cluster. juser free test 49087 1 63
-- 72:00 R 00:03
10.hpc.cluster juser free test 45691 1 63
-- 72:00 R 00:03
11.hpc.cluster juser free test 41386 1 63
-- 72:00 R 00:03
12.hpc.cluster juser free test 35204 1 63
-- 72:00 R 00:03
13.hpc.cluster juser free test 51043 1 63
-- 72:00 R 00:03
14.hpc.cluster juser free test 24948 1 1
-- 72:00 R 00:03
15.hpc.cluster juser free test 29390 1 1
-- 72:00 R 00:02
16.hpc.cluster juser free test 42944 1 1
-- 72:00 R 00:02
17.hpc.cluster juser free test 30335 1 1
-- 72:00 R 00:02
18.hpc.cluster juser free test 26461 1 1
-- 72:00 R 00:02
19.hpc.cluster juser free test 40250 1 1
-- 72:00 R 00:02
20.hpc.cluster juser free test 1830 1 1
-- 72:00 R 00:02
21.hpc.cluster juser free test 62480 1 1
-- 72:00 R 00:03
22.hpc.cluster juser free test 49131 1 1
-- 72:00 R 00:03
23.hpc.cluster juser free test 45735 1 1
-- 72:00 R 00:03
24.hpc.cluster juser free test 41430 1 1
-- 72:00 R 00:03
25.hpc.cluster juser free test 35248 1 1
-- 72:00 R 00:03
26.hpc.cluster juser free test 51087 1 1
-- 72:00 R 00:03
27.hpc.cluster juser free test 30749 1 1
-- 72:00 R --
28.hpc.cluster juser free test 44220 1 1
-- 72:00 R --
29.hpc.cluster juser free test 31513 1 1
-- 72:00 R --
30.hpc.cluster juser free test 27736 1 1
-- 72:00 R --
31.hpc.cluster juser free test 41429 1 1
-- 72:00 R --
32.hpc.cluster juser free test 3130 1 1
-- 72:00 R --
33.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
34.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
35.hpc.cluster juser free test -- 1 1
-- 72:00 Q --
37.hpc.cluster tw tw STDIN 30708 6 372
-- 99:00 R --
However, if I now exit and go back and try to get 6 of the 64-core nodes
(which worked before) I cannot. Maui will not preempt the new jobs it
started.
My new job 38 below just sits in the queue:
$ qsub -I -q tw -l nodes=6:ppn=64
qsub: waiting for job 38.hpc.cluster to start
# diagnose -p
diagnosing job priority information (partition: ALL)
Job PRIORITY* Cred( QOS) Serv(QTime)
Weights -------- 100( 1000) 1( 1)
38 100000109 100.0(1000.) 0.0(109.4)
2 5 0.0( 0.0) 100.0( 5.3)
3 5 0.0( 0.0) 100.0( 5.3)
4 5 0.0( 0.0) 100.0( 5.3)
5 5 0.0( 0.0) 100.0( 5.3)
6 5 0.0( 0.0) 100.0( 5.3)
7 5 0.0( 0.0) 100.0( 5.3)
Percent Contribution -------- 100.0(100.0) 0.0( 0.0)
[root at mpc-x maui]# checkjob -v 38
checking job 38 (RM job '38.hpc.cluster')
State: Idle
Creds: user:tw group:tw class:tw qos:high
WallTime: 00:00:00 of 4:03:00:00
SubmitTime: Thu Feb 16 08:26:31
(Time Queued Total: 00:01:37 Eligible: 00:01:37)
Total Tasks: 384
Req[0] TaskCount: 384 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [tw]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SHARED
TasksPerNode: 64 NodeCount: 6
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: PREEMPTOR
Reservation '38' (2:23:58:22 -> 7:02:58:22 Duration: 4:03:00:00)
PE: 384.00 StartPriority: 100000163
job cannot run in partition DEFAULT (idle procs do not meet requirements
: 0 of 384 procs found)
idle procs: 384 feasible procs: 0
Rejection Reasons: [Features : 7][CPU : 6]
Detailed Node Availability Information:
compute-1-1 rejected : Features
compute-1-2 rejected : Features
compute-1-3 rejected : Features
compute-1-4 rejected : Features
compute-1-5 rejected : Features
compute-1-6 rejected : Features
compute-1-7 rejected : CPU
compute-1-8 rejected : CPU
compute-1-9 rejected : CPU
compute-1-10 rejected : CPU
compute-1-11 rejected : CPU
compute-1-12 rejected : CPU
compute-1-13 rejected : Features
-------------------------------------------------------
Here is my PBS nodes file:
# cat /opt/torque/server_priv/nodes
compute-1-1 np=64 sf free
compute-1-2 np=64 sf free
compute-1-3 np=64 sf free
compute-1-4 np=64 chem free
compute-1-5 np=64 chem free
compute-1-6 np=64 chem free
compute-1-7 np=64 tw free
compute-1-8 np=64 tw free
compute-1-9 np=64 tw free
compute-1-10 np=64 tw free
compute-1-11 np=64 tw free
compute-1-12 np=64 tw free
compute-1-13 np=64 bio free
------------------------------------
Edsall, William (WJ) wrote:
Hi,
What does diagnose -p say about the priority of the jobs you expect to
be preempted? Priority may take precedence over preemptability.
From: mauiusers-bounces at supercluster.org
[mailto:mauiusers-bounces at supercluster.org] On Behalf Of Joseph Farran
Sent: Monday, February 13, 2012 3:19 PM
To: mauiusers at supercluster.org
Subject: [Mauiusers] Backfill Jobs prevent PREEMPTOR jobs from running?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/mauiusers/attachments/20120216/334e16c9/attachment-0001.html
More information about the mauiusers
mailing list