[torqueusers] $PBS_NODEFILE and exec_host problem
PN
poknam at gmail.com
Fri Nov 6 20:15:14 MST 2009
Hi,
I'm using maui-3.2.6p21-snap.1252608389 and torque-2.4.2.
I have 2 nodes, each with 4 cpus.
$ cat pbs.sh
#PBS -l nodes=2:ppn=4
#PBS -N hpl-8cpus
#PBS -j oe
cd /home/admin/hpl/hpl-2.0-openmpi
cat $PBS_NODEFILE
NP=`wc -l $PBS_NODEFILE | awk '{ print $1 }'`
cat $PBS_NODEFILE | awk '{ print $1"-clust" }' > ./machines
cd /home/admin/hpl/hpl-2.0-openmpi
/usr/mpi/gcc/openmpi-1.3.2/bin/mpirun -np $NP -machinefile ./machines
./bin/core2-goto-openmpi/xhpl
$ checkjob -v 30
checking job 30 (RM job '30.mgmt.v5cluster.com')
State: Running
Creds: user:admin group:admin class:batch qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Sat Nov 7 11:00:24
(Time Queued Total: 00:00:01 Eligible: 00:00:01)
StartTime: Sat Nov 7 11:00:25
Total Tasks: 8
Req[0] TaskCount: 8 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
Utilized Resources Per Task: [NONE]
Avg Util Resources Per Task: [NONE]
Max Util Resources Per Task: [NONE]
NodeAccess: SHARED
TasksPerNode: 4 NodeCount: 2
Allocated Nodes:
[node0002:4][node0001:4]
Task Distribution:
node0002,node0002,node0002,node0002,node0001,node0001,node0001,node0001
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
Reservation '30' (00:00:00 -> 1:00:00 Duration: 1:00:00)
PE: 8.00 StartPriority: 1
$ qstat -f
Job Id: 30.mgmt.v5cluster.com
Job_Name = hpl-8cpus
Job_Owner = admin at mgmt.v5cluster.com
job_state = R
queue = batch
server = mgmt.v5cluster.com
Checkpoint = u
ctime = Sat Nov 7 11:00:24 2009
Error_Path = mgmt.v5cluster.com:
/home/admin/hpl/hpl-2.0-openmpi/hpl-8cpus.
e30
exec_host = node0001/3+node0001/2+node0001/1+node0001/0
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Sat Nov 7 11:00:25 2009
Output_Path = mgmt.v5cluster.com:
/home/admin/hpl/hpl-2.0-openmpi/hpl-8cpus
.o30
Priority = 0
qtime = Sat Nov 7 11:00:24 2009
Rerunable = True
Resource_List.nodect = 2
Resource_List.nodes = 2:ppn=4
Resource_List.walltime = 01:00:00
session_id = 16102
Variable_List = PBS_O_HOME=/home/admin,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=admin,
PBS_O_PATH=/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin
:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/sbi
n:/usr/bin:/root/bin:/usr/sbin:/usr/bin,
PBS_O_MAIL=/var/spool/mail/root,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=mgmt.v5cluster.com,PBS_SERVER=mgmt.v5cluster.com,
PBS_O_WORKDIR=/home/admin/hpl/hpl-2.0-openmpi,PBS_O_QUEUE=batch
etime = Sat Nov 7 11:00:24 2009
submit_args = pbs.sh
start_time = Sat Nov 7 11:00:25 2009
start_count = 1
fault_tolerant = False
Below is the maui.log.
11/07 11:11:29 INFO: connect request from 11.1.0.1
11/07 11:11:29 INFO: received service request from host '
mgmt.v5cluster.com'
11/07 11:11:29 MSURecvPacket(9,BufP,4,NULL,100000,SC)
11/07 11:11:31 ServerProcessRequests()
11/07 11:11:31 INFO: not rolling logs (5304 < 10000000)
11/07 11:11:31 MResAdjust(NULL,0,0)
11/07 11:11:31 MStatInitializeActiveSysUsage()
11/07 11:11:31 MStatClearUsage([NONE],Active)
11/07 11:11:31 ServerUpdate()
11/07 11:11:31 MSysUpdateTime()
11/07 11:11:31 INFO: starting iteration 77
11/07 11:11:31 MRMGetInfo()
11/07 11:11:31 MClusterClearUsage()
11/07 11:11:31 MRMClusterQuery()
11/07 11:11:31 MPBSClusterQuery(base,RCount,SC)
11/07 11:11:31 __MPBSGetNodeState(Name,State,PNode)
11/07 11:11:31 INFO: PBS node node0001 set to state Idle (free)
11/07 11:11:31 MPBSNodeUpdate(node0001,node0001,Idle,base)
11/07 11:11:31 MPBSLoadQueueInfo(base,node0001,SC)
11/07 11:11:31 INFO: queue 'batch' started state set to True
11/07 11:11:31 INFO: class to node not mapping enabled for queue 'batch'
adding class to all nodes
11/07 11:11:31 __MPBSGetNodeState(Name,State,PNode)
11/07 11:11:31 INFO: PBS node node0002 set to state Idle (free)
11/07 11:11:31 MPBSNodeUpdate(node0002,node0002,Idle,base)
11/07 11:11:31 MPBSLoadQueueInfo(base,node0002,SC)
11/07 11:11:31 INFO: queue 'batch' started state set to True
11/07 11:11:31 INFO: class to node not mapping enabled for queue 'batch'
adding class to all nodes
11/07 11:11:31 INFO: 2 PBS resources detected on RM base
11/07 11:11:31 INFO: resources detected: 2
11/07 11:11:31 MRMWorkloadQuery()
11/07 11:11:31 MPBSWorkloadQuery(base,JCount,SC)
11/07 11:11:31 MPBSJobLoad(31,31.mgmt.v5cluster.com,J,TaskList,0)
11/07 11:11:31 MReqCreate(31,SrcRQ,DstRQ,DoCreate)
11/07 11:11:31 INFO: processing node request line '2:ppn=4'
11/07 11:11:31 MJobSetCreds(31,admin,admin,)
11/07 11:11:31 INFO: default QOS for job 31 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
11/07 11:11:31 INFO: default QOS for job 31 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
11/07 11:11:31 INFO: default QOS for job 31 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
11/07 11:11:31 INFO: job '31' loaded: 8 admin admin 3600
Idle 0 1257563489 [NONE] [NONE] [NONE] >= 0 >= 0 [NONE]
1257563491
11/07 11:11:31 INFO: 1 PBS jobs detected on RM base
11/07 11:11:31 INFO: jobs detected: 1
11/07 11:11:31 MStatClearUsage(node,Active)
11/07 11:11:31 MClusterUpdateNodeState()
11/07 11:11:31 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
11/07 11:11:31 INFO: job '31' Priority: 1
11/07 11:11:31 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
0(00.0)
11/07 11:11:31 MStatClearUsage([NONE],Active)
11/07 11:11:31 INFO: total jobs selected (ALL): 1/1
11/07 11:11:31 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
11/07 11:11:31 INFO: job '31' Priority: 1
11/07 11:11:31 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0) Us:
0(00.0)
11/07 11:11:31 MStatClearUsage([NONE],Idle)
11/07 11:11:31 INFO: total jobs selected (ALL): 1/1
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE)
11/07 11:11:31 INFO: total jobs selected in partition ALL: 1/1
11/07 11:11:31 MQueueScheduleRJobs(Q)
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
11/07 11:11:31 INFO: total jobs selected in partition ALL: 1/1
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
11/07 11:11:31 INFO: total jobs selected in partition DEFAULT: 1/1
11/07 11:11:31 MQueueScheduleIJobs(Q,DEFAULT)
11/07 11:11:31 INFO: 8 feasible tasks found for job 31:0 in partition
DEFAULT (8 Needed)
11/07 11:11:31 INFO: tasks located for job 31: 8 of 8 required (8
feasible)
11/07 11:11:31 MJobStart(31)
11/07 11:11:31 MJobDistributeTasks(31,base,NodeList,TaskMap)
11/07 11:11:31 MAMAllocJReserve(31,RIndex,ErrMsg)
11/07 11:11:31 MRMJobStart(31,Msg,SC)
11/07 11:11:31 MPBSJobStart(31,base,Msg,SC)
11/07 11:11:31 INFO: job '31' successfully started
11/07 11:11:31 MStatUpdateActiveJobUsage(31)
11/07 11:11:31 MResJCreate(31,MNodeList,00:00:00,ActiveJob,Res)
11/07 11:11:31 INFO: starting job '31'
11/07 11:11:31 INFO: 1 jobs started on iteration 77
Active Jobs------
------------------
11/07 11:11:31 INFO: resources available after scheduling: N: 0 P: 0
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,DEFAULT,FReason,TRUE)
11/07 11:11:31 INFO: total jobs selected in partition DEFAULT: 0/1
[State: 1]
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)
11/07 11:11:31 INFO: total jobs selected in partition ALL: 0/1 [State:
1]
11/07 11:11:31
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
11/07 11:11:31 INFO: total jobs selected in partition ALL: 0/1 [State:
1]
11/07 11:11:31 MSchedUpdateStats()
11/07 11:11:31 INFO: iteration: 77 scheduling time: 0.008 seconds
11/07 11:11:31 MResUpdateStats()
11/07 11:11:31 INFO: current util[77]: 2/2 (100.00%) PH: 0.88% active
jobs: 1 of 2 (completed: 29)
11/07 11:11:31 MQueueCheckStatus()
11/07 11:11:31 MNodeCheckStatus()
11/07 11:11:31 MUClearChild(PID)
11/07 11:11:31 INFO: scheduling complete. sleeping 30 seconds
But I can see that the checkjob command can show the allocated nodes
correctly. It seems that Maui runs correctly.
However in the exec_host and the $PBS_NODEFILE, it only allocated 4 cpus in
the same node.
Is it the Torque problem?
I've tried to add "JOBNODEMATCHPOLICY EXACTNODE" and
"ENABLEMULTIREQJOBS TRUE" to the maui.cfg but no help.
Anyone know how to solve this? Any suggestion is appreciated.
Thanks.
--
Best Regards,
PN Lai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20091107/f2f62763/attachment-0001.html
More information about the torqueusers
mailing list