[Mauiusers] RE: Problem with MAUI and partition (PartitionAccess)

Balle, Susanne susanne.balle at hp.com
Thu Dec 9 11:46:20 MST 2004


My apologies if you get this twice, Susanne

-----Original Message-----
From: Balle, Susanne 
Sent: Thursday, December 09, 2004 1:21 PM
To: 'mauiusers at supercluster.org'
Cc: Balle, Susanne
Subject: Problem with MAUI and partition (PartitionAccess)



Hi

I am trying to use Maui and SLURM.

I have Maui and SLURM running and they seem to exchange some
information.

When using MAUI as the scheduler, jobs are not started. Jobs are
detected but never started. I am running the following job: "srun -n 2
-t 20 ./slurm.sh" 
where slurm.sh is 
#!/bin/sh
`which hostname`

>From the output of checkjob I get the follow: 
cannot select job 19 for partition DEFAULT (PartitionAccess)

I have enclosed some info below: Output from checkjob 19, output from
diagnose -t, tail -110 maui.log as well as details about how I built
MAUI and integrated it with SLURM In the output below job 18 and job 19
are the same job. I just got terminated job 18 before I had all the
output I needed for this email.

Thanks for any help,

Regards

Susanne

---------------------------------------------------------------
Susanne M. Balle, PhD
Hewlett-Packard
MS ZKO02-3/Q08
110 Spit Brook Road
Nashua, NH 03062

Phone: 603-884-7732
Fax:     603-884-0630

Susanne.Balle at hp.com

------------------------------------------------------------------------
-----------

checking job 19

State: Idle
Creds:  user:root  group:root  qos:DEFAULT
WallTime: 00:00:00 of 00:20:00
SubmitTime: Thu Dec  9 18:09:30
  (Time Queued  Total: 00:00:06  Eligible: 00:00:06)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 1M  Disk >= 1M  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
NodeCount: 1

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [lsf]
PE:  1.00  StartPriority:  1

cannot select job 19 for partition DEFAULT (PartitionAccess)

[root at xc14n16 log]# diagnose -t
Displaying Partition Status

System Partition Settings:  PList: DEFAULT PDef: DEFAULT

Name                    Procs

DEFAULT                    10

Partition    Configured         Up     U/C  Dedicated     D/U     Active
A/U

NODE--------------------------------------------------------------------
--------
DEFAULT               4          4 100.00%          0   0.00%          0
0.00%
PROC--------------------------------------------------------------------
--------
DEFAULT              10         10 100.00%          0   0.00%          0
0.00%
MEM---------------------------------------------------------------------
-------
DEFAULT           12756      12756 100.00%          0   0.00%          0
0.00%
DISK--------------------------------------------------------------------
--------
DEFAULT           44716      44716 100.00%          0   0.00%          0
0.00%

Class/Queue State

             [<CLASS> <AVAIL>:<UP>]...

     DEFAULT [NONE]

tail -110 maui.log gives me the following.

12/09 17:55:57 INFO:     starting iteration 229
12/09 17:55:57 MRMGetInfo()
12/09 17:55:57 MClusterClearUsage()
12/09 17:55:57 MRMClusterQuery()
12/09 17:55:57 MWikiClusterLoadInfo(XC14N16,RCount,EMsg,SC)
12/09 17:55:57 MWikiDoCommand(XC14N16,7321,9000000,CHECKSUM,CMD=GETNODES
ARG=0:ALL,Data,DataSize,SC) 12/09 17:55:57
MSUSendData(S,9000000,TRUE,FALSE)
12/09 17:55:57 INFO:     packet sent (78 bytes of 78)
12/09 17:55:57 INFO:     command sent to server
12/09 17:55:57 INFO:     message sent: 'CMD=GETNODES ARG=0:ALL'
12/09 17:55:57 MSURecvData(S,9000000,1)
12/09 17:55:57 MSURecvPacket(8,Buffer,9,NULL,9000000)
12/09 17:55:57 MSURecvPacket(8,Buffer,269,NULL,9000000)
12/09 17:55:57 MSUDisconnect(S)
12/09 17:55:57 INFO:     received node list through WIKI RM
12/09 17:55:57 INFO:     loading 4 node(s)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n13)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n14)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n15)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n16)
12/09 17:55:57 INFO:     0 WIKI resources detected on RM XC14N16
12/09 17:55:57 WARNING:  no resources detected
12/09 17:55:57 MRMWorkloadQuery()
12/09 17:55:57 MWikiWorkloadQuery(XC14N16,JCount,SC)
12/09 17:55:57 MWikiDoCommand(XC14N16,7321,9000000,CHECKSUM,CMD=GETJOBS
ARG=0:ALL,Data,DataSize,SC) 12/09 17:55:57
MSUSendData(S,9000000,TRUE,FALSE)
12/09 17:55:57 INFO:     packet sent (77 bytes of 77)
12/09 17:55:57 INFO:     command sent to server
12/09 17:55:57 INFO:     message sent: 'CMD=GETJOBS ARG=0:ALL'
12/09 17:55:57 MSURecvData(S,9000000,1)
12/09 17:55:57 MSURecvPacket(8,Buffer,9,NULL,9000000)
12/09 17:55:57 MSURecvPacket(8,Buffer,200,NULL,9000000)
12/09 17:55:57 MSUDisconnect(S)
12/09 17:55:57 INFO:     received job list through WIKI RM
12/09 17:55:57 INFO:     loading 1 job(s)
12/09 17:55:57 MWikiUpdateJob(AList,18,0)
12/09 17:55:57 MUGetIndex(UPDATETIME=1102632406,ValList,0)
12/09 17:55:57 MUGetIndex(STATE=Idle,ValList,0)
12/09 17:55:57 MUGetIndex(WCLIMIT=1200,ValList,0)
12/09 17:55:57 MUGetIndex(TASKS=1,ValList,0)
12/09 17:55:57 MUGetIndex(QUEUETIME=1102632406,ValList,0)
12/09 17:55:57 MUGetIndex(UNAME=root,ValList,0)
12/09 17:55:57 MUGetIndex(GNAME=root,ValList,0)
12/09 17:55:57 MUGetIndex(PARTITIONMASK=lsf,ValList,0)
12/09 17:55:57 MUGetIndex(NODES=1,ValList,0)
12/09 17:55:57 MUGetIndex(RMEM=1,ValList,0)
12/09 17:55:57 MUGetIndex(RDISK=1,ValList,0)
12/09 17:55:57 INFO:     1 WIKI jobs detected on RM XC14N16
12/09 17:55:57 INFO:     jobs detected: 1
12/09 17:55:57 MStatClearUsage(node,Active)
12/09 17:55:57 MClusterUpdateNodeState()
12/09 17:55:57 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
12/09 17:55:57 INFO:     job '18' Priority:        9
12/09 17:55:57 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      9(00.0)
Targ:      0(00.0)  Res:      0(00.0)  Us:      0(00.0)
12/09 17:55:57 MStatClearUsage([NONE],Active)
12/09 17:55:57 INFO:     total jobs selected (ALL): 1/1
12/09 17:55:57 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
12/09 17:55:57 INFO:     job '18' Priority:        9
12/09 17:55:57 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      9(00.0)
Targ:      0(00.0)  Res:      0(00.0)  Us:      0(00.0)
12/09 17:55:57 MStatClearUsage([NONE],Idle)
12/09 17:55:57 INFO:     total jobs selected (ALL): 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE
)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57 MQueueScheduleRJobs(Q)
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRU
E)
12/09 17:55:57 INFO:     total jobs selected in partition DEFAULT: 0/1
[PartitionAccess: 1]
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57 MQueueScheduleRJobs(Q)
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRU
E)
12/09 17:55:57 INFO:     total jobs selected in partition DEFAULT: 0/1
[PartitionAccess: 1]
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,DEFAULT,FReason,TRU
E)
12/09 17:55:57 INFO:     total jobs selected in partition DEFAULT: 0/1
[PartitionAccess: 1]
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57 INFO:     job '18' Priority:        9
12/09 17:55:57 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      9(00.0)
Targ:      0(00.0)  Res:      0(00.0)  Us:      0(00.0)
12/09 17:55:57 MSchedUpdateStats()
12/09 17:55:57 INFO:     iteration:  229   scheduling time:  0.002
seconds
12/09 17:55:57 MResUpdateStats()
12/09 17:55:57 INFO:     current util[229]:  0/4 (0.00%)  PH: 0.00%
active jobs: 0 of 2 (completed: 0)
12/09 17:55:57 MQueueCheckStatus()
12/09 17:55:57 MNodeCheckStatus()
12/09 17:55:57 MUClearChild(PID)
12/09 17:55:57 INFO:     scheduling complete.  sleeping 20 seconds
[root at xc14n16 log]#

Thanks for any help,

Regards 

Susanne 

------------------------------------------------------------------------
-------

For details about how I built MAUI and integrated it with SLURM see
section below.

I downloaded the MAUI kit: maui-3.2.6p9 from the MAUI website and
compiled MAUI from its source distribution. I tried to follow the steps
located at http://www.llnl.gov/linux/slurm/maui.html

The configuration step didn't ask me if I want to build MAUI with PBS
and didn't ask me for a checksum seed either as it is documented in the
SLURM integration document.

Reading further down in the SLURM integration instruction I noticed that
SLURM will be using the Wiki interface to MAUI.

>From the doc it looks like my configure line should look something like:

./configure --with-key=42 --with-wiki

Completed as expected

gmake

Completed as expected

Next I update the MAUI configuration file: maui.cfg with the following
info:

# Resource Manager Definition

RMCFG[XC14N16] TYPE=WIKI
RMPORT          7321
RMHOST          XC14N16
RMPOLLINTERVAL  00:00:20

In /hptc_cluster/slurm/etc/slurm.conf 

uncommented the following lines:
SchedulerType=sched/wiki
SchedulerAuth=42
SchedulerPort=7321

I started maui and slurm and some of commands work.

[root at xc14n16 log]# showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING
STARTTIME


     0 Active Jobs       0 of   10 Processors Active (0.00%)
                         0 of    4 Nodes Active      (0.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
QUEUETIME

18                     root       Idle     1    00:20:00  Thu Dec  9
17:46:46

1 Idle Job

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
QUEUETIME


Total Jobs: 1   Active Jobs: 0   Idle Jobs: 1   Blocked Jobs: 0
[root at xc14n16 log]#




More information about the mauiusers mailing list