[Mauiusers] RE: Problem with MAUI and partition(PartitionAccess)

Balle, Susanne susanne.balle at hp.com
Mon Dec 13 04:05:43 MST 2004


Dave,

Enclosed find the maui.cfg file.

Thanks for helping me out,

Susanne

-----Original Message-----
From: Dave Jackson [mailto:jacksond at clusterresources.com] 
Sent: Friday, December 10, 2004 4:01 PM
To: Balle, Susanne; mauiusers at supercluster.org
Subject: Re: [Mauiusers] RE: Problem with MAUI and
partition(PartitionAccess)


Susanne,

  The partition access failure inidcates that the credentials associated
with the job do not have access to the resources inside the partition. 
This access should be enabled by default and should only be denied if an
explicit configuration limiting partition access is specified in the
config file.  Can you send us your maui.cfg file?

Dave

>>> "Balle, Susanne" <susanne.balle at hp.com> 12/09/04 11:46 AM >>>

My apologies if you get this twice, Susanne

-----Original Message-----
From: Balle, Susanne 
Sent: Thursday, December 09, 2004 1:21 PM
To: 'mauiusers at supercluster.org'
Cc: Balle, Susanne
Subject: Problem with MAUI and partition (PartitionAccess)



Hi

I am trying to use Maui and SLURM.

I have Maui and SLURM running and they seem to exchange some
information.

When using MAUI as the scheduler, jobs are not started. Jobs are
detected but never started. I am running the following job: "srun -n 2
-t 20 ./slurm.sh" 
where slurm.sh is 
#!/bin/sh
`which hostname`

>From the output of checkjob I get the follow:
cannot select job 19 for partition DEFAULT (PartitionAccess)

I have enclosed some info below: Output from checkjob 19, output from
diagnose -t, tail -110 maui.log as well as details about how I built
MAUI and integrated it with SLURM In the output below job 18 and job 19
are the same job. I just got terminated job 18 before I had all the
output I needed for this email.

Thanks for any help,

Regards

Susanne

---------------------------------------------------------------
Susanne M. Balle, PhD
Hewlett-Packard
MS ZKO02-3/Q08
110 Spit Brook Road
Nashua, NH 03062

Phone: 603-884-7732
Fax:     603-884-0630

Susanne.Balle at hp.com

------------------------------------------------------------------------
-----------

checking job 19

State: Idle
Creds:  user:root  group:root  qos:DEFAULT
WallTime: 00:00:00 of 00:20:00
SubmitTime: Thu Dec  9 18:09:30
  (Time Queued  Total: 00:00:06  Eligible: 00:00:06)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 1M  Disk >= 1M  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
NodeCount: 1

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [lsf]
PE:  1.00  StartPriority:  1

cannot select job 19 for partition DEFAULT (PartitionAccess)

[root at xc14n16 log]# diagnose -t
Displaying Partition Status

System Partition Settings:  PList: DEFAULT PDef: DEFAULT

Name                    Procs

DEFAULT                    10

Partition    Configured         Up     U/C  Dedicated     D/U     Active
A/U

NODE--------------------------------------------------------------------
--------
DEFAULT               4          4 100.00%          0   0.00%          0
0.00%
PROC--------------------------------------------------------------------
--------
DEFAULT              10         10 100.00%          0   0.00%          0
0.00%
MEM---------------------------------------------------------------------
-------
DEFAULT           12756      12756 100.00%          0   0.00%          0
0.00%
DISK--------------------------------------------------------------------
--------
DEFAULT           44716      44716 100.00%          0   0.00%          0
0.00%

Class/Queue State

             [<CLASS> <AVAIL>:<UP>]...

     DEFAULT [NONE]

tail -110 maui.log gives me the following.

12/09 17:55:57 INFO:     starting iteration 229
12/09 17:55:57 MRMGetInfo()
12/09 17:55:57 MClusterClearUsage()
12/09 17:55:57 MRMClusterQuery()
12/09 17:55:57 MWikiClusterLoadInfo(XC14N16,RCount,EMsg,SC)
12/09 17:55:57 MWikiDoCommand(XC14N16,7321,9000000,CHECKSUM,CMD=GETNODES
ARG=0:ALL,Data,DataSize,SC) 12/09 17:55:57
MSUSendData(S,9000000,TRUE,FALSE)
12/09 17:55:57 INFO:     packet sent (78 bytes of 78)
12/09 17:55:57 INFO:     command sent to server
12/09 17:55:57 INFO:     message sent: 'CMD=GETNODES ARG=0:ALL'
12/09 17:55:57 MSURecvData(S,9000000,1)
12/09 17:55:57 MSURecvPacket(8,Buffer,9,NULL,9000000)
12/09 17:55:57 MSURecvPacket(8,Buffer,269,NULL,9000000)
12/09 17:55:57 MSUDisconnect(S)
12/09 17:55:57 INFO:     received node list through WIKI RM
12/09 17:55:57 INFO:     loading 4 node(s)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n13)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n14)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n15)
12/09 17:55:57 MWikiNodeUpdate(AList,xc14n16)
12/09 17:55:57 INFO:     0 WIKI resources detected on RM XC14N16
12/09 17:55:57 WARNING:  no resources detected
12/09 17:55:57 MRMWorkloadQuery()
12/09 17:55:57 MWikiWorkloadQuery(XC14N16,JCount,SC)
12/09 17:55:57 MWikiDoCommand(XC14N16,7321,9000000,CHECKSUM,CMD=GETJOBS
ARG=0:ALL,Data,DataSize,SC) 12/09 17:55:57
MSUSendData(S,9000000,TRUE,FALSE)
12/09 17:55:57 INFO:     packet sent (77 bytes of 77)
12/09 17:55:57 INFO:     command sent to server
12/09 17:55:57 INFO:     message sent: 'CMD=GETJOBS ARG=0:ALL'
12/09 17:55:57 MSURecvData(S,9000000,1)
12/09 17:55:57 MSURecvPacket(8,Buffer,9,NULL,9000000)
12/09 17:55:57 MSURecvPacket(8,Buffer,200,NULL,9000000)
12/09 17:55:57 MSUDisconnect(S)
12/09 17:55:57 INFO:     received job list through WIKI RM
12/09 17:55:57 INFO:     loading 1 job(s)
12/09 17:55:57 MWikiUpdateJob(AList,18,0)
12/09 17:55:57 MUGetIndex(UPDATETIME=1102632406,ValList,0)
12/09 17:55:57 MUGetIndex(STATE=Idle,ValList,0)
12/09 17:55:57 MUGetIndex(WCLIMIT=1200,ValList,0)
12/09 17:55:57 MUGetIndex(TASKS=1,ValList,0)
12/09 17:55:57 MUGetIndex(QUEUETIME=1102632406,ValList,0)
12/09 17:55:57 MUGetIndex(UNAME=root,ValList,0)
12/09 17:55:57 MUGetIndex(GNAME=root,ValList,0)
12/09 17:55:57 MUGetIndex(PARTITIONMASK=lsf,ValList,0)
12/09 17:55:57 MUGetIndex(NODES=1,ValList,0)
12/09 17:55:57 MUGetIndex(RMEM=1,ValList,0)
12/09 17:55:57 MUGetIndex(RDISK=1,ValList,0)
12/09 17:55:57 INFO:     1 WIKI jobs detected on RM XC14N16
12/09 17:55:57 INFO:     jobs detected: 1
12/09 17:55:57 MStatClearUsage(node,Active)
12/09 17:55:57 MClusterUpdateNodeState()
12/09 17:55:57 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
12/09 17:55:57 INFO:     job '18' Priority:        9
12/09 17:55:57 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      9(00.0)
Targ:      0(00.0)  Res:      0(00.0)  Us:      0(00.0)
12/09 17:55:57 MStatClearUsage([NONE],Active)
12/09 17:55:57 INFO:     total jobs selected (ALL): 1/1
12/09 17:55:57 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
12/09 17:55:57 INFO:     job '18' Priority:        9
12/09 17:55:57 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      9(00.0)
Targ:      0(00.0)  Res:      0(00.0)  Us:      0(00.0)
12/09 17:55:57 MStatClearUsage([NONE],Idle)
12/09 17:55:57 INFO:     total jobs selected (ALL): 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE
)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57 MQueueScheduleRJobs(Q)
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRU
E)
12/09 17:55:57 INFO:     total jobs selected in partition DEFAULT: 0/1
[PartitionAccess: 1]
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57 MQueueScheduleRJobs(Q)
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRU
E)
12/09 17:55:57 INFO:     total jobs selected in partition DEFAULT: 0/1
[PartitionAccess: 1]
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,DEFAULT,FReason,TRU
E)
12/09 17:55:57 INFO:     total jobs selected in partition DEFAULT: 0/1
[PartitionAccess: 1]
12/09 17:55:57
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
12/09 17:55:57 INFO:     total jobs selected in partition ALL: 1/1
12/09 17:55:57 INFO:     job '18' Priority:        9
12/09 17:55:57 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      9(00.0)
Targ:      0(00.0)  Res:      0(00.0)  Us:      0(00.0)
12/09 17:55:57 MSchedUpdateStats()
12/09 17:55:57 INFO:     iteration:  229   scheduling time:  0.002
seconds
12/09 17:55:57 MResUpdateStats()
12/09 17:55:57 INFO:     current util[229]:  0/4 (0.00%)  PH: 0.00%
active jobs: 0 of 2 (completed: 0)
12/09 17:55:57 MQueueCheckStatus()
12/09 17:55:57 MNodeCheckStatus()
12/09 17:55:57 MUClearChild(PID)
12/09 17:55:57 INFO:     scheduling complete.  sleeping 20 seconds
[root at xc14n16 log]#

Thanks for any help,

Regards 

Susanne 

------------------------------------------------------------------------
-------

For details about how I built MAUI and integrated it with SLURM see
section below.

I downloaded the MAUI kit: maui-3.2.6p9 from the MAUI website and
compiled MAUI from its source distribution. I tried to follow the steps
located at http://www.llnl.gov/linux/slurm/maui.html

The configuration step didn't ask me if I want to build MAUI with PBS
and didn't ask me for a checksum seed either as it is documented in the
SLURM integration document.

Reading further down in the SLURM integration instruction I noticed that
SLURM will be using the Wiki interface to MAUI.

>From the doc it looks like my configure line should look something
like:

./configure --with-key=42 --with-wiki

Completed as expected

gmake

Completed as expected

Next I update the MAUI configuration file: maui.cfg with the following
info:

# Resource Manager Definition

RMCFG[XC14N16] TYPE=WIKI
RMPORT          7321
RMHOST          XC14N16
RMPOLLINTERVAL  00:00:20

In /hptc_cluster/slurm/etc/slurm.conf 

uncommented the following lines:
SchedulerType=sched/wiki
SchedulerAuth=42
SchedulerPort=7321

I started maui and slurm and some of commands work.

[root at xc14n16 log]# showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING
STARTTIME


     0 Active Jobs       0 of   10 Processors Active (0.00%)
                         0 of    4 Nodes Active      (0.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
QUEUETIME

18                     root       Idle     1    00:20:00  Thu Dec  9
17:46:46

1 Idle Job

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
QUEUETIME


Total Jobs: 1   Active Jobs: 0   Idle Jobs: 1   Blocked Jobs: 0
[root at xc14n16 log]#


_______________________________________________
mauiusers mailing list
mauiusers at supercluster.org
http://supercluster.org/mailman/listinfo/mauiusers

-------------- next part --------------
A non-text attachment was scrubbed...
Name: maui.cfg
Type: application/octet-stream
Size: 2104 bytes
Desc: maui.cfg
Url : http://www.supercluster.org/pipermail/mauiusers/attachments/20041213/318ed5a5/maui.obj


More information about the mauiusers mailing list