[torqueusers] start intel mpi in pbs

Donald Tripp dtripp at hawaii.edu
Wed Jun 27 02:50:53 MDT 2007


It looks like torque is configured not to run jobs on the admin /  
main node (in this case, "cluster").  Thats why you get only 3 hosts  
available, because it will launch mpd on the c0-0, c0-1, and c0-2  
nodes, but not cluster.

By default, torque is setup not to allow jobs to run on the admin /  
main node, but this can be enabled, and will have to be in your case.


- Donald Tripp
   dtripp at hawaii.edu
----------------------------------------------
HPC Systems Administrator
High Performance Computing Center
University of Hawai'i at Hilo
200 W. Kawili Street
Hilo,   Hawaii   96720
http://www.hpc.uhh.hawaii.edu


On Jun 26, 2007, at 10:38 PM, Chaucer Cao wrote:

> Hi all,
>
> In the pbs script file I can’t start the mpd (intel mpi ) useing  
> the following command
>
> ********************************************************************** 
> ******
>
> mpdboot  --rsh=ssh -v -n `cat mpd.hosts|wc -l`  -f mpd.hosts
>
> ********************************************************************** 
> ******
>
> It gives:
>
> ---------------------------------------------------------------------- 
> ----------------------------
>
> totalnum=4  numhosts=3
>
> there are not enough hosts on which to start all processes
>
> ---------------------------------------------------------------------- 
> ----------------------------
>
> But I can manually start mpd using the same command.
>
> ---------------------------------------------------------------------- 
> ---------------------------
>
> [mpp at cluster std]$  mpdboot --rsh=ssh -v -n 4 -f mpd.hosts
>
> running mpdallexit on cluster
>
> LAUNCHED mpd on cluster  via
>
> RUNNING: mpd on cluster
>
> LAUNCHED mpd on c0-0  via  cluster
>
> LAUNCHED mpd on c0-1  via  cluster
>
> LAUNCHED mpd on c0-2  via  cluster
>
> RUNNING: mpd on c0-0
>
> RUNNING: mpd on c0-1
>
> RUNNING: mpd on c0-2
>
> ---------------------------------------------------------------------- 
> ---------------------------
>
>
>
> Does any one know how to fix? Many thanks!
>
> Best wishes,
>
> Chaucer
>
>
>
> 发件人: Chaucer Cao [mailto:ccao at sgi.com]
> 发送时间: 2007年6月26日 14:12
> 收件人: 'Krause, Roland'
> 主题: 答复: [torqueusers] how to get Environment Variables
>
>
>
> Hi Roland,
>
> Maybe the pbsnodes give the ntype cluster info. You :
>
> c0-2
>
>      state = free
>
>      np = 4
>
>      ntype = cluster
>
>      status = opsys=linux,uname=Linux compute-0-2.local  
> 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST 2006  
> x86_64,sessions=14316,nsessions=1,nusers=1,idletime=105210,totmem=5045 
> 676kb,availmem=4608468kb,physmem=4025560kb,ncpus=4,loadave=4.00,netloa 
> d=483100398328,state=free,jobs=,varattr=,rectime=1182836318
>
>
>
> c0-1
>
>      state = free
>
>      np = 4
>
>      ntype = cluster
>
>      status = opsys=linux,uname=Linux compute-0-1.local  
> 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST 2006  
> x86_64,sessions=26709,nsessions=1,nusers=1,idletime=234995,totmem=5045 
> 672kb,availmem=4592532kb,physmem=4025556kb,ncpus=4,loadave=4.00,netloa 
> d=697953068235,state=free,jobs=,varattr=,rectime=1182836316
>
>
>
> c0-0
>
>      state = free
>
>      np = 4
>
>      ntype = cluster
>
>      status = opsys=linux,uname=Linux compute-0-0.local  
> 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST 2006  
> x86_64,sessions=28348,nsessions=1,nusers=1,idletime=220618,totmem=5045 
> 676kb,availmem=4557852kb,physmem=4025560kb,ncpus=4,loadave=4.00,netloa 
> d=588068945521,state=free,jobs=,varattr=,rectime=1182836318
>
>
>
> cluster
>
>      state = free
>
>      np = 4
>
>      ntype = cluster
>
>      status = opsys=linux,uname=Linux cluster.hpc.org  
> 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST 2006  
> x86_64,sessions=2993 24894 25052 25158  
> 25307,nsessions=5,nusers=3,idletime=92734,totmem=5045676kb,availmem=41 
> 30016kb,physmem=4025560kb,ncpus=4,loadave=4.48,netload=678702222035,st 
> ate=free,jobs=,varattr=,rectime=1182836315
>
> ---------------------------------------------------------------------- 
> ---------------------------------------------------------------------- 
> ---------------------------------------------------------------------- 
> ----------------------------------
>
> It seems the head node get the different domain. In the /etc/hosts
>
> #
>
> # Do NOT Edit (generated by dbreport)
>
> #
>
> 127.0.0.1       localhost.localdomain   localhost
>
> 10.1.1.1        cluster.local cluster # originally frontend-0-0
>
> 10.255.255.254  compute-0-0.local compute-0-0 c0-0
>
> 10.255.255.253  compute-0-1.local compute-0-1 c0-1
>
> 10.255.255.252  compute-0-2.local compute-0-2 c0-2
>
> 192.168.1.1     cluster.hpc.org
>
> But I don’t how tell the pbs_server he should use the  
> cluster.local. J thanks!
>
> Best wishes,
>
> Chaucer
>
>
>
>
>
> 发件人: Krause, Roland [mailto:Roland.Krause at amtc-dresden.com]
> 发送时间: 2007年6月25日 19:37
> 收件人: Chaucer Cao
> 主题: RE: [torqueusers] how to get Environment Variables
>
>
>
> Hi Chaucer,
>
>
>
> beside our production system we have a test system with two nodes.  
> One of them is server,
>
> but I can run jobs with qsub -l nodes=2.
>
> Do all your nodes have the "ntype" "cluster"?
>
>
>
> Regards,
>
> Roland
>
>
>
> From: Chaucer Cao [mailto:ccao at sgi.com]
> Sent: Monday, June 25, 2007 10:33 AM
> To: Krause, Roland
> Subject: ??: [torqueusers] how to get Environment Variables
>
> Hi Roland,
>
> The Environment variables problem is OK now. but I encounter  
> another problem:
>
> There are four nodes including the head node. But I only can submit  
> 3-node job by qsub. When I submit a 4-node job it gives:
>
> c0-0
>
> c0-1
>
> c0-2
>
> cluster
>
> totalnum=4  numhosts=3
>
> there are not enough hosts on which to start all processes
>
>   1. no mpd is running on this host
>
>   2. an mpd is running but was started without a "console" (-n option)
>
> mpdtrace: cannot connect to local mpd (/tmp/mpd2.console_ccao);  
> possible causes:
>
> mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_ccao);  
> possible causes:
>
>   1. no mpd is running on this host
>
>   2. an mpd is running but was started without a "console" (-n option)
>
> It seems I can’t run the job on head node(cluster) with pbs. But I  
> can run 4-node job directly (without qsub).
>
> When I use pbsnodes to check it seems all nodes are in free status.  
> Can you help me on this? Many thanks!
>
> Best wishes,
>
> Chaucer
>
>
>
>
>
> 发件人: Krause, Roland [mailto:Roland.Krause at amtc-dresden.com]
> 发送时间: 2007年6月25日 15:13
> 收件人: Chaucer Cao
> 主题: RE: [torqueusers] how to get Environment Variables
>
>
>
> Hi Chaucer,
>
>
>
> Could you provide  the part of your script, which is reading PBS  
> env variables?
>
>
>
> Regards,
>
> Roland
>
>
>
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers- 
> bounces at supercluster.org] On Behalf Of Chaucer Cao
> Sent: Wednesday, June 20, 2007 7:16 PM
> To: torqueusers at supercluster.org
> Subject: [torqueusers] how to get Environment Variables
>
> Hi all,
>
> Does any one know how can I get the the PBS environment variables  
> in the run script file. When I qsub my script file it gives:
>
> PBS_NODEFILE: Undefined variable.
>
> PBS_ENVIRONMENT: Undefined variable.
>
> Many thanks!
>
> Chaucer
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20070626/bf519e78/attachment-0001.html


More information about the torqueusers mailing list