[Mauiusers] RE: Problem with MAUI and partition (PartitionAccess)
Balle, Susanne
susanne.balle at hp.com
Thu Dec 9 11:46:20 MST 2004
My apologies if you get this twice, Susanne
-----Original Message-----
From: Balle, Susanne
Sent: Thursday, December 09, 2004 1:21 PM
To: 'mauiusers at supercluster.org'
Cc: Balle, Susanne
Subject: Problem with MAUI and partition (PartitionAccess)
Hi
I am trying to use Maui and SLURM.
I have Maui and SLURM running and they seem to exchange some
information.
When using MAUI as the scheduler, jobs are not started. Jobs are
detected but never started. I am running the following job: "srun -n 2
-t 20 ./slurm.sh"
where slurm.sh is
#!/bin/sh
`which hostname`
>From the output of checkjob I get the follow:
cannot select job 19 for partition DEFAULT (PartitionAccess)
I have enclosed some info below: Output from checkjob 19, output from
diagnose -t, tail -110 maui.log as well as details about how I built
MAUI and integrated it with SLURM In the output below job 18 and job 19
are the same job. I just got terminated job 18 before I had all the
output I needed for this email.
Thanks for any help,
Regards
Susanne
---------------------------------------------------------------
Susanne M. Balle, PhD
Hewlett-Packard
MS ZKO02-3/Q08
110 Spit Brook Road
Nashua, NH 03062
Phone: 603-884-7732
Fax: 603-884-0630
Susanne.Balle at hp.com
------------------------------------------------------------------------
-----------
checking job 19
State: Idle
Creds: user:root group:root qos:DEFAULT
WallTime: 00:00:00 of 00:20:00
SubmitTime: Thu Dec 9 18:09:30
(Time Queued Total: 00:00:06 Eligible: 00:00:06)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 1M Disk >= 1M Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
NodeCount: 1
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [lsf]
PE: 1.00 StartPriority: 1
cannot select job 19 for partition DEFAULT (PartitionAccess)
[root at xc14n16 log]# diagnose -t
Displaying Partition Status
System Partition Settings: PList: DEFAULT PDef: DEFAULT
Name Procs
DEFAULT 10
Partition Configured Up U/C Dedicated D/U Active
A/U
NODE--------------------------------------------------------------------
--------
DEFAULT 4 4 100.00% 0 0.00% 0
0.00%
PROC--------------------------------------------------------------------
--------
DEFAULT 10 10 100.00% 0 0.00% 0
0.00%
MEM---------------------------------------------------------------------
-------
DEFAULT 12756 12756 100.00% 0 0.00% 0
0.00%
DISK--------------------------------------------------------------------
--------
DEFAULT 44716 44716 100.00% 0 0.00% 0
0.00%
Class/Queue State
[<CLASS> <AVAIL>:<UP>]...
DEFAULT [NONE]
tail -110 maui.log gives me the following.
12/09 17:55:57 INFO: starting iteration 229
12/09 17:55:57 MRMGetInfo()
12/09 17:55:57 MClusterClearUsage()
12/09 17:55:57 MRMClusterQuery()
12/09 17:55:57 MWikiClusterLoadInfo(XC14N16,RCount,EMsg,SC)
12/09 17:55:57 MWikiDoCommand(XC14N16,7321,9000000,CHECKSUM,CMD=GETNODES
ARG=0:ALL,Data,DataSize,SC) 12/09 17:55:57
MSUSendData(S,9000000,TRUE,FALSE)
12/09 17:55:57 INFO: packet sent (78 bytes of 78)
12/09 17:55:57 INFO: command sent to server
12/09 17:55:57 INFO: message sent: 'CMD=GETNODES ARG=0:ALL'
12/09 17:55:57 MSURecvData(S,9000000,1)
12/09 17:55:57 MSURecvPacket(8,Buffer,9,NULL,9000000)
12/09 17:55:57 MSURecvPacket(8,Buffer,269,NULL,9000000)
12/09 17:55:57 MSUDisconnect(S)
12/09 17:55:57 INFO: received node list through WIKI RM
12/09 17:55:57 INFO: loading 4 node(s)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n13)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n14)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n15)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n16)
12/09 17:55:57 INFO: 0 WIKI resources detected on RM XC14N16
12/09 17:55:57 WARNING: no resources detected
12/09 17:55:57 MRMWorkloadQuery()
12/09 17:55:57 MWikiWorkloadQuery(XC14N16,JCount,SC)
12/09 17:55:57 MWikiDoCommand(XC14N16,7321,9000000,CHECKSUM,CMD=GETJOBS
ARG=0:ALL,Data,DataSize,SC) 12/09 17:55:57
MSUSendData(S,9000000,TRUE,FALSE)
12/09 17:55:57 INFO: packet sent (77 bytes of 77)
12/09 17:55:57 INFO: command sent to server
12/09 17:55:57 INFO: message sent: 'CMD=GETJOBS ARG=0:ALL'
12/09 17:55:57 MSURecvData(S,9000000,1)
12/09 17:55:57 MSURecvPacket(8,Buffer,9,NULL,9000000)
12/09 17:55:57 MSURecvPacket(8,Buffer,200,NULL,9000000)
12/09 17:55:57 MSUDisconnect(S)
12/09 17:55:57 INFO: received job list through WIKI RM
12/09 17:55:57 INFO: loading 1 job(s)
12/09 17:55:57 MWikiUpdateJob(AList,18,0)
12/09 17:55:57 MUGetIndex(UPDATETIME=1102632406,ValList,0)
12/09 17:55:57 MUGetIndex(STATE=Idle,ValList,0)
12/09 17:55:57 MUGetIndex(WCLIMIT=1200,ValList,0)
12/09 17:55:57 MUGetIndex(TASKS=1,ValList,0)
12/09 17:55:57 MUGetIndex(QUEUETIME=1102632406,ValList,0)
12/09 17:55:57 MUGetIndex(UNAME=root,ValList,0)
12/09 17:55:57 MUGetIndex(GNAME=root,ValList,0)
12/09 17:55:57 MUGetIndex(PARTITIONMASK=lsf,ValList,0)
12/09 17:55:57 MUGetIndex(NODES=1,ValList,0)
12/09 17:55:57 MUGetIndex(RMEM=1,ValList,0)
12/09 17:55:57 MUGetIndex(RDISK=1,ValList,0)
12/09 17:55:57 INFO: 1 WIKI jobs detected on RM XC14N16
12/09 17:55:57 INFO: jobs detected: 1
12/09 17:55:57 MStatClearUsage(node,Active)
12/09 17:55:57 MClusterUpdateNodeState()
12/09 17:55:57 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
12/09 17:55:57 INFO: job '18' Priority: 9
12/09 17:55:57 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 9(00.0)
Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)
12/09 17:55:57 MStatClearUsage([NONE],Active)
12/09 17:55:57 INFO: total jobs selected (ALL): 1/1
12/09 17:55:57 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
12/09 17:55:57 INFO: job '18' Priority: 9
12/09 17:55:57 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 9(00.0)
Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)
12/09 17:55:57 MStatClearUsage([NONE],Idle)
12/09 17:55:57 INFO: total jobs selected (ALL): 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE
)
12/09 17:55:57 INFO: total jobs selected in partition ALL: 1/1
12/09 17:55:57 MQueueScheduleRJobs(Q)
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO: total jobs selected in partition ALL: 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRU
E)
12/09 17:55:57 INFO: total jobs selected in partition DEFAULT: 0/1
[PartitionAccess: 1]
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO: total jobs selected in partition ALL: 1/1
12/09 17:55:57 MQueueScheduleRJobs(Q)
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO: total jobs selected in partition ALL: 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRU
E)
12/09 17:55:57 INFO: total jobs selected in partition DEFAULT: 0/1
[PartitionAccess: 1]
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO: total jobs selected in partition ALL: 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,DEFAULT,FReason,TRU
E)
12/09 17:55:57 INFO: total jobs selected in partition DEFAULT: 0/1
[PartitionAccess: 1]
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO: total jobs selected in partition ALL: 1/1
12/09 17:55:57 INFO: job '18' Priority: 9
12/09 17:55:57 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 9(00.0)
Targ: 0(00.0) Res: 0(00.0) Us: 0(00.0)
12/09 17:55:57 MSchedUpdateStats()
12/09 17:55:57 INFO: iteration: 229 scheduling time: 0.002
seconds
12/09 17:55:57 MResUpdateStats()
12/09 17:55:57 INFO: current util[229]: 0/4 (0.00%) PH: 0.00%
active jobs: 0 of 2 (completed: 0)
12/09 17:55:57 MQueueCheckStatus()
12/09 17:55:57 MNodeCheckStatus()
12/09 17:55:57 MUClearChild(PID)
12/09 17:55:57 INFO: scheduling complete. sleeping 20 seconds
[root at xc14n16 log]#
Thanks for any help,
Regards
Susanne
------------------------------------------------------------------------
-------
For details about how I built MAUI and integrated it with SLURM see
section below.
I downloaded the MAUI kit: maui-3.2.6p9 from the MAUI website and
compiled MAUI from its source distribution. I tried to follow the steps
located at http://www.llnl.gov/linux/slurm/maui.html
The configuration step didn't ask me if I want to build MAUI with PBS
and didn't ask me for a checksum seed either as it is documented in the
SLURM integration document.
Reading further down in the SLURM integration instruction I noticed that
SLURM will be using the Wiki interface to MAUI.
>From the doc it looks like my configure line should look something like:
./configure --with-key=42 --with-wiki
Completed as expected
gmake
Completed as expected
Next I update the MAUI configuration file: maui.cfg with the following
info:
# Resource Manager Definition
RMCFG[XC14N16] TYPE=WIKI
RMPORT 7321
RMHOST XC14N16
RMPOLLINTERVAL 00:00:20
In /hptc_cluster/slurm/etc/slurm.conf
uncommented the following lines:
SchedulerType=sched/wiki
SchedulerAuth=42
SchedulerPort=7321
I started maui and slurm and some of commands work.
[root at xc14n16 log]# showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING
STARTTIME
0 Active Jobs 0 of 10 Processors Active (0.00%)
0 of 4 Nodes Active (0.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
18 root Idle 1 00:20:00 Thu Dec 9
17:46:46
1 Idle Job
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
Total Jobs: 1 Active Jobs: 0 Idle Jobs: 1 Blocked Jobs: 0
[root at xc14n16 log]#
More information about the mauiusers
mailing list