[Mauiusers] maui not scheduling when no resources avaliable
Arnau Bria
arnaubria at pic.es
Thu Dec 13 08:05:06 MST 2007
On Thu, 13 Dec 2007 13:59:59 +0100
Jan Ploski wrote:
> mauiusers-bounces at supercluster.org schrieb am 12/13/2007 01:05:06 PM:
>
> ...
> > Something similar happened when requesting hosst with "slc3 &&
> > slc4", no nodes fit that condition and maui got hanged....
> >
> > So, is it a bug?¿ Is anyone having same problem ? any workaround?
>
> I once had the same kind of problem - a job stuck in the front of the
> queue preventing other jobs from executing even though checkjob
> reported "can run" for them. In my case, it was due to an
> inconsistency between Maui's and TORQUE's view of the available
> resources - Maui was trying to assign a job to an already occupied
> resource - because it thought jobs running there each use 0
> processors, TORQUE was rejecting these attempts.
>
> Maybe the output of diagnose -n <name of the offline node>, diagnose
> -j <job id>, diagnose -r will provide additional clues?
# diagnose -n td248.pic.es
diagnosing node table (5120 slots)
Name State Procs Memory Disk Swap Speed Opsys Arch Par Load Res Classes Network Features
td248.pic.es Drained 0:10 8006:8006 1:1 4535:4535 1.00 linux [NONE] DEF 0.00 000 [long_10:10][medium_10:10][sho [DEFAULT] [slc3]
----- --- 0:10 8006:8006 1:1 4535:4535
Total Nodes: 1 (Active: 0 Idle: 0 Down: 1)
]# diagnose -j 3446272
Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features
3446272 Idle ALL 1 DEF 3:00:00:00 0 1 arnaubri grid - 00:08:27 [NONE] [NONE] [NONE] >=0 >=0 NC0 [slc3:1] [slc3]
]# diagnose -r
Diagnosing Reservations
ResID Type Par StartTime EndTime Duration Node Task Proc
----- ---- --- --------- ------- -------- ---- ---- ----
3435058 Job DEF -8:41:47 15:18:13 1:00:00:00 1 1 1
ACL: JOB==3435058=
CL: JOB==3435058 USER==atlas057 GROUP==atlas CLASS==gshort QOS==lhcatlas DURATION==1:00:00:00 PROC==1
3435059 Job DEF -8:41:35 15:18:25 1:00:00:00 1 1 1
ACL: JOB==3435059=
CL: JOB==3435059 USER==atlas057 GROUP==atlas CLASS==gshort QOS==lhcatlas DURATION==1:00:00:00 PROC==1
3435060 Job DEF -8:40:53 15:19:07 1:00:00:00 1 1 1
ACL: JOB==3435060=
CL: JOB==3435060 USER==atlas057 GROUP==atlas CLASS==gshort QOS==lhcatlas DURATION==1:00:00:00 PROC==1
3435062 Job DEF -8:41:26 15:18:34 1:00:00:00 1 1 1
[...]
Active Reserved Processors: 218
What exactly are you looking for with "-r"?¿
> Another thing that you might try is setting the node 'down' (kill
> pbs_mom on it) rather than 'offline' to see if it changes anything.
[root at td248 root]# /etc/init.d/pbs_mom stop
Shutting down TORQUE Mom: [ OK ]
# pbsnodes td248.pic.es
td248.pic.es
state = down,offline
np = 10
properties = slc3
ntype = cluster
Not sure if killing pbs_mom did the trick, but now maui is scheduling
fine, event if I set the wn only offline...
]# !pb
pbsnodes td248.pic.es
td248.pic.es
state = offline
np = 10
properties = slc3
> Best regards,
> Jan Ploski
Cheers,
Arnau
More information about the mauiusers
mailing list