[torqueusers] $PBS_NODEFILE and exec_host problem

PN poknam at gmail.com
Fri Nov 6 20:15:14 MST 2009


Hi,

I'm using  maui-3.2.6p21-snap.1252608389 and torque-2.4.2.
I have 2 nodes, each with 4 cpus.

$ cat pbs.sh
#PBS -l nodes=2:ppn=4
#PBS -N hpl-8cpus
#PBS -j oe

cd /home/admin/hpl/hpl-2.0-openmpi

cat $PBS_NODEFILE

NP=`wc -l $PBS_NODEFILE | awk '{ print $1 }'`

cat $PBS_NODEFILE | awk '{ print $1"-clust" }' > ./machines

cd /home/admin/hpl/hpl-2.0-openmpi
/usr/mpi/gcc/openmpi-1.3.2/bin/mpirun -np $NP -machinefile ./machines
./bin/core2-goto-openmpi/xhpl

$ checkjob -v 30


checking job 30 (RM job '30.mgmt.v5cluster.com')

State: Running
Creds:  user:admin  group:admin  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Sat Nov  7 11:00:24
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

StartTime: Sat Nov  7 11:00:25
Total Tasks: 8

Req[0]  TaskCount: 8  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
Utilized Resources Per Task:  [NONE]
Avg Util Resources Per Task:  [NONE]
Max Util Resources Per Task:  [NONE]
NodeAccess: SHARED
TasksPerNode: 4  NodeCount: 2
Allocated Nodes:
[node0002:4][node0001:4]
Task Distribution:
node0002,node0002,node0002,node0002,node0001,node0001,node0001,node0001


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '30' (00:00:00 -> 1:00:00  Duration: 1:00:00)
PE:  8.00  StartPriority:  1


$ qstat -f
Job Id: 30.mgmt.v5cluster.com
    Job_Name = hpl-8cpus
    Job_Owner = admin at mgmt.v5cluster.com
    job_state = R
    queue = batch
    server = mgmt.v5cluster.com
    Checkpoint = u
    ctime = Sat Nov  7 11:00:24 2009
    Error_Path = mgmt.v5cluster.com:
/home/admin/hpl/hpl-2.0-openmpi/hpl-8cpus.
        e30
    exec_host = node0001/3+node0001/2+node0001/1+node0001/0
    Hold_Types = n
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Sat Nov  7 11:00:25 2009
    Output_Path = mgmt.v5cluster.com:
/home/admin/hpl/hpl-2.0-openmpi/hpl-8cpus
        .o30
    Priority = 0
    qtime = Sat Nov  7 11:00:24 2009
    Rerunable = True
    Resource_List.nodect = 2
    Resource_List.nodes = 2:ppn=4
    Resource_List.walltime = 01:00:00
    session_id = 16102
    Variable_List = PBS_O_HOME=/home/admin,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=admin,

 PBS_O_PATH=/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin

 :/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/sbi
        n:/usr/bin:/root/bin:/usr/sbin:/usr/bin,
        PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,
        PBS_O_HOST=mgmt.v5cluster.com,PBS_SERVER=mgmt.v5cluster.com,
        PBS_O_WORKDIR=/home/admin/hpl/hpl-2.0-openmpi,PBS_O_QUEUE=batch
    etime = Sat Nov  7 11:00:24 2009
    submit_args = pbs.sh
    start_time = Sat Nov  7 11:00:25 2009
    start_count = 1
    fault_tolerant = False


Below is the maui.log.

11/07 11:11:29 INFO:     connect request from 11.1.0.1
11/07 11:11:29 INFO:     received service request from host '
mgmt.v5cluster.com'
11/07 11:11:29 MSURecvPacket(9,BufP,4,NULL,100000,SC)
11/07 11:11:31 ServerProcessRequests()
11/07 11:11:31 INFO:     not rolling logs (5304 < 10000000)
11/07 11:11:31 MResAdjust(NULL,0,0)
11/07 11:11:31 MStatInitializeActiveSysUsage()
11/07 11:11:31 MStatClearUsage([NONE],Active)
11/07 11:11:31 ServerUpdate()
11/07 11:11:31 MSysUpdateTime()
11/07 11:11:31 INFO:     starting iteration 77
11/07 11:11:31 MRMGetInfo()
11/07 11:11:31 MClusterClearUsage()
11/07 11:11:31 MRMClusterQuery()
11/07 11:11:31 MPBSClusterQuery(base,RCount,SC)
11/07 11:11:31 __MPBSGetNodeState(Name,State,PNode)
11/07 11:11:31 INFO:     PBS node node0001 set to state Idle (free)
11/07 11:11:31 MPBSNodeUpdate(node0001,node0001,Idle,base)
11/07 11:11:31 MPBSLoadQueueInfo(base,node0001,SC)
11/07 11:11:31 INFO:     queue 'batch' started state set to True
11/07 11:11:31 INFO:     class to node not mapping enabled for queue 'batch'
adding class to all nodes
11/07 11:11:31 __MPBSGetNodeState(Name,State,PNode)
11/07 11:11:31 INFO:     PBS node node0002 set to state Idle (free)
11/07 11:11:31 MPBSNodeUpdate(node0002,node0002,Idle,base)
11/07 11:11:31 MPBSLoadQueueInfo(base,node0002,SC)
11/07 11:11:31 INFO:     queue 'batch' started state set to True
11/07 11:11:31 INFO:     class to node not mapping enabled for queue 'batch'
adding class to all nodes
11/07 11:11:31 INFO:     2 PBS resources detected on RM base
11/07 11:11:31 INFO:     resources detected: 2
11/07 11:11:31 MRMWorkloadQuery()
11/07 11:11:31 MPBSWorkloadQuery(base,JCount,SC)
11/07 11:11:31 MPBSJobLoad(31,31.mgmt.v5cluster.com,J,TaskList,0)
11/07 11:11:31 MReqCreate(31,SrcRQ,DstRQ,DoCreate)
11/07 11:11:31 INFO:     processing node request line '2:ppn=4'
11/07 11:11:31 MJobSetCreds(31,admin,admin,)
11/07 11:11:31 INFO:     default QOS for job 31 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
11/07 11:11:31 INFO:     default QOS for job 31 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
11/07 11:11:31 INFO:     default QOS for job 31 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
11/07 11:11:31 INFO:     job '31' loaded:   8    admin    admin   3600
Idle   0 1257563489   [NONE] [NONE] [NONE] >=      0 >=      0 [NONE]
1257563491
11/07 11:11:31 INFO:     1 PBS jobs detected on RM base
11/07 11:11:31 INFO:     jobs detected: 1
11/07 11:11:31 MStatClearUsage(node,Active)
11/07 11:11:31 MClusterUpdateNodeState()
11/07 11:11:31 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
11/07 11:11:31 INFO:     job '31' Priority:        1
11/07 11:11:31 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
 0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
 0(00.0)
11/07 11:11:31 MStatClearUsage([NONE],Active)
11/07 11:11:31 INFO:     total jobs selected (ALL): 1/1
11/07 11:11:31 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
11/07 11:11:31 INFO:     job '31' Priority:        1
11/07 11:11:31 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
 0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
 0(00.0)
11/07 11:11:31 MStatClearUsage([NONE],Idle)
11/07 11:11:31 INFO:     total jobs selected (ALL): 1/1
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)
11/07 11:11:31 INFO:     total jobs selected in partition ALL: 1/1
11/07 11:11:31 MQueueScheduleRJobs(Q)
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
11/07 11:11:31 INFO:     total jobs selected in partition ALL: 1/1
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
11/07 11:11:31 INFO:     total jobs selected in partition DEFAULT: 1/1
11/07 11:11:31 MQueueScheduleIJobs(Q,DEFAULT)
11/07 11:11:31 INFO:     8 feasible tasks found for job 31:0 in partition
DEFAULT (8 Needed)
11/07 11:11:31 INFO:     tasks located for job 31:  8 of 8 required (8
feasible)
11/07 11:11:31 MJobStart(31)
11/07 11:11:31 MJobDistributeTasks(31,base,NodeList,TaskMap)
11/07 11:11:31 MAMAllocJReserve(31,RIndex,ErrMsg)
11/07 11:11:31 MRMJobStart(31,Msg,SC)
11/07 11:11:31 MPBSJobStart(31,base,Msg,SC)
11/07 11:11:31 INFO:     job '31' successfully started
11/07 11:11:31 MStatUpdateActiveJobUsage(31)
11/07 11:11:31 MResJCreate(31,MNodeList,00:00:00,ActiveJob,Res)
11/07 11:11:31 INFO:     starting job '31'
11/07 11:11:31 INFO:     1 jobs started on iteration 77
Active Jobs------
------------------
11/07 11:11:31 INFO:     resources available after scheduling: N: 0  P: 0
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
11/07 11:11:31 INFO:     total jobs selected in partition DEFAULT: 0/1
[State: 1]
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)
11/07 11:11:31 INFO:     total jobs selected in partition ALL: 0/1 [State:
1]
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
11/07 11:11:31 INFO:     total jobs selected in partition ALL: 0/1 [State:
1]
11/07 11:11:31 MSchedUpdateStats()
11/07 11:11:31 INFO:     iteration:   77   scheduling time:  0.008 seconds
11/07 11:11:31 MResUpdateStats()
11/07 11:11:31 INFO:     current util[77]:  2/2 (100.00%)  PH: 0.88%  active
jobs: 1 of 2 (completed: 29)
11/07 11:11:31 MQueueCheckStatus()
11/07 11:11:31 MNodeCheckStatus()
11/07 11:11:31 MUClearChild(PID)
11/07 11:11:31 INFO:     scheduling complete.  sleeping 30 seconds




But I can see that the checkjob command can show the allocated nodes
correctly. It seems that Maui runs correctly.
However in the exec_host and the $PBS_NODEFILE, it only allocated 4 cpus in
the same node.
Is it the Torque problem?

I've tried to add "JOBNODEMATCHPOLICY      EXACTNODE" and
"ENABLEMULTIREQJOBS      TRUE" to the maui.cfg but no help.

Anyone know how to solve this? Any suggestion is appreciated.

Thanks.



-- 
Best Regards,
PN Lai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20091107/f2f62763/attachment-0001.html 


More information about the torqueusers mailing list