=?gb2312?B?tPC4tDogtPC4tDogW3RvcnF1ZXVzZXJzXSBzdGFydCBpbnRlbCBtcA==?= =?gb2312?B?aSBpbiBwYnM=?=

Chaucer Cao ccao at sgi.com
Wed Jun 27 07:59:08 MDT 2007


Hi Donald,
I have already added the server name in the nodes file and started the
pbs_mom on the server. Does it need some other special sets?=20
and the problem is when I run=20
 mpdboot --rsh=3Dssh -v -n `cat mpd.hosts|wc -l` -f mpd.hosts
it gives=20
totalnum=3D4 numhosts=3D3
why only 3 hosts is available as there are four nodes?=20
If I used two nodes (cluster and c0-0),when I want to submit a two nodes
job. I will also failed and gives:
Totalnum=3D2 numhosts=3D1
Many thanks!
Chacuer

Danie wrote:
>
>Add the name of the server to the nodes file and start a pbs_mom on the
>server
>
>Chaucer Cao schreef:
>>
>> Hi Donald,
>>
>> How to enable the admin node as a computer node in torque? Many =
thanks!
>>
>> Best wishes,
>>
>> Chaucer
>>
>> =
------------------------------------------------------------------------
>>
>> *=B7=A2=BC=FE=C8=CB:* Donald Tripp [mailto:dtripp at hawaii.edu]
>> *=B7=A2=CB=CD=CA=B1=BC=E4:* 2007=C4=EA6=D4=C227=C8=D5 16:51
>> *=CA=D5=BC=FE=C8=CB:* Chaucer Cao
>> *=B3=AD=CB=CD:* torqueusers at supercluster.org; 'Krause, Roland'
>> *=D6=F7=CC=E2:* Re: [torqueusers] start intel mpi in pbs
>>
>> It looks like torque is configured not to run jobs on the admin / =
main
>> node (in this case, "cluster"). Thats why you get only 3 hosts
>> available, because it will launch mpd on the c0-0, c0-1, and c0-2
>> nodes, but not cluster.
>>
>> By default, torque is setup not to allow jobs to run on the admin /
>> main node, but this can be enabled, and will have to be in your case.
>>
>> - Donald Tripp
>>
>> dtripp at hawaii.edu <mailto:dtripp at hawaii.edu>
>>
>> ----------------------------------------------
>>
>> HPC Systems Administrator
>>
>> High Performance Computing Center
>>
>> University of Hawai'i at Hilo
>>
>> 200 W. Kawili Street
>>
>> Hilo, Hawaii 96720
>>
>> http://www.hpc.uhh.hawaii.edu <http://www.hpc.uhh.hawaii.edu/>
>>
>>
>>
>> On Jun 26, 2007, at 10:38 PM, Chaucer Cao wrote:
>>
>>
>>
>> Hi all,
>>
>> In the pbs script file I can=A1=AFt start the mpd (intel mpi ) useing =
the
>> following command
>>
>>
>************************************************************************=
***
>*
>>
>> mpdboot --rsh=3Dssh -v -n `cat mpd.hosts|wc -l` -f mpd.hosts
>>
>>
>************************************************************************=
***
>*
>>
>> It gives:
>>
>>
>------------------------------------------------------------------------=
---
>-----------------------
>>
>> totalnum=3D4 numhosts=3D3
>>
>> there are not enough hosts on which to start all processes
>>
>>
>------------------------------------------------------------------------=
---
>-----------------------
>>
>> But I can manually start mpd using the same command.
>>
>>
>------------------------------------------------------------------------=
---
>----------------------
>>
>> [mpp at cluster std]$ mpdboot --rsh=3Dssh -v -n 4 -f mpd.hosts
>>
>> running mpdallexit on cluster
>>
>> LAUNCHED mpd on cluster via
>>
>> RUNNING: mpd on cluster
>>
>> LAUNCHED mpd on c0-0 via cluster
>>
>> LAUNCHED mpd on c0-1 via cluster
>>
>> LAUNCHED mpd on c0-2 via cluster
>>
>> RUNNING: mpd on c0-0
>>
>> RUNNING: mpd on c0-1
>>
>> RUNNING: mpd on c0-2
>>
>>
>------------------------------------------------------------------------=
---
>----------------------
>>
>> Does any one know how to fix? Many thanks!
>>
>> Best wishes,
>>
>> Chaucer
>>
>> =
------------------------------------------------------------------------
>>
>> *=B7=A2=BC=FE=C8=CB:* Chaucer Cao [mailto:ccao at sgi.com]
>> *=B7=A2=CB=CD=CA=B1=BC=E4:* 2007=C4=EA6=D4=C226=C8=D5 14:12
>> *=CA=D5=BC=FE=C8=CB:* 'Krause, Roland'
>> *=D6=F7=CC=E2:* =B4=F0=B8=B4: [torqueusers] how to get Environment =
Variables
>>
>> Hi Roland,
>>
>> Maybe the pbsnodes give the ntype cluster info. You :
>>
>> c0-2
>>
>> state =3D free
>>
>> np =3D 4
>>
>> ntype =3D cluster
>>
>> status =3D opsys=3Dlinux,uname=3DLinux *compute-0-2.local*
>> 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST 2006
>>
>x86_64,sessions=3D14316,nsessions=3D1,nusers=3D1,idletime=3D105210,totme=
m=3D5045676kb
>,availmem=3D4608468kb,physmem=3D4025560kb,ncpus=3D4,loadave=3D4.00,netlo=
ad=3D48310039
>8328,state=3Dfree,jobs=3D,varattr=3D,rectime=3D1182836318
>>
>> c0-1
>>
>> state =3D free
>>
>> np =3D 4
>>
>> ntype =3D cluster
>>
>> status =3D opsys=3Dlinux,uname=3DLinux *compute-0-1.local*
>> 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST 2006
>>
>x86_64,sessions=3D26709,nsessions=3D1,nusers=3D1,idletime=3D234995,totme=
m=3D5045672kb
>,availmem=3D4592532kb,physmem=3D4025556kb,ncpus=3D4,loadave=3D4.00,netlo=
ad=3D69795306
>8235,state=3Dfree,jobs=3D,varattr=3D,rectime=3D1182836316
>>
>> c0-0
>>
>> state =3D free
>>
>> np =3D 4
>>
>> ntype =3D cluster
>>
>> status =3D opsys=3Dlinux,uname=3DLinux *compute-0-0.local*
>> 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST 2006
>>
>x86_64,sessions=3D28348,nsessions=3D1,nusers=3D1,idletime=3D220618,totme=
m=3D5045676kb
>,availmem=3D4557852kb,physmem=3D4025560kb,ncpus=3D4,loadave=3D4.00,netlo=
ad=3D58806894
>5521,state=3Dfree,jobs=3D,varattr=3D,rectime=3D1182836318
>>
>> cluster
>>
>> state =3D free
>>
>> np =3D 4
>>
>> ntype =3D cluster
>>
>> status =3D opsys=3Dlinux,uname=3DLinux *cluster.hpc.org* =
2.6.9-42.0.2.ELsmp
>> #1 SMP Wed Aug 23 13:38:27 BST 2006 x86_64,sessions=3D2993 24894 =
25052
>> 25158
>>
>25307,nsessions=3D5,nusers=3D3,idletime=3D92734,totmem=3D5045676kb,avail=
mem=3D4130016
>kb,physmem=3D4025560kb,ncpus=3D4,loadave=3D4.48,netload=3D678702222035,s=
tate=3Dfree,j
>obs=3D,varattr=3D,rectime=3D1182836315
>>
>>
>------------------------------------------------------------------------=
---
>------------------------------------------------------------------------=
---
>------------------------------------------------------------------------=
---
>-------------------
>>
>> It seems the head node get the different domain. In the /etc/hosts
>>
>> #
>>
>> # Do NOT Edit (generated by dbreport)
>>
>> #
>>
>> 127.0.0.1 localhost.localdomain localhost
>>
>> 10.1.1.1 cluster.local cluster # originally frontend-0-0
>>
>> 10.255.255.254 compute-0-0.local compute-0-0 c0-0
>>
>> 10.255.255.253 compute-0-1.local compute-0-1 c0-1
>>
>> 10.255.255.252 compute-0-2.local compute-0-2 c0-2
>>
>> 192.168.1.1 cluster.hpc.org
>>
>> But I don=A1=AFt how tell the pbs_server he should use the =
cluster.local. J
>> thanks!
>>
>> Best wishes,
>>
>> Chaucer
>>
>> =
------------------------------------------------------------------------
>>
>> *=B7=A2=BC=FE=C8=CB:* Krause, Roland =
[mailto:Roland.Krause at amtc-dresden.com]
>> *=B7=A2=CB=CD=CA=B1=BC=E4:* 2007=C4=EA6=D4=C225=C8=D5 19:37
>> *=CA=D5=BC=FE=C8=CB:* Chaucer Cao
>> *=D6=F7=CC=E2:* RE: [torqueusers] how to get Environment Variables
>>
>> Hi Chaucer,
>>
>> beside our production system we have a test system with two nodes. =
One
>> of them is server,
>>
>> but I can run jobs with qsub -l nodes=3D2.
>>
>> Do all your nodes have the "ntype" "cluster"?
>>
>> Regards,
>>
>> Roland
>>
>>
>------------------------------------------------------------------------=

>>
>>     *From:* Chaucer Cao [mailto:ccao at sgi.com]
>>     *Sent:* Monday, June 25, 2007 10:33 AM
>>     *To:* Krause, Roland
>>     *Subject:* ??: [torqueusers] how to get Environment Variables
>>
>>     Hi Roland,
>>
>>     The Environment variables problem is OK now. but I encounter
>>     another problem:
>>
>>     There are four nodes including the head node. But I only can
>>     submit 3-node job by qsub. When I submit a 4-node job it gives:
>>
>>     c0-0
>>
>>     c0-1
>>
>>     c0-2
>>
>>     cluster
>>
>>     totalnum=3D4 numhosts=3D3
>>
>>     there are not enough hosts on which to start all processes
>>
>>     1. no mpd is running on this host
>>
>>     2. an mpd is running but was started without a "console" (-n =
option)
>>
>>     mpdtrace: cannot connect to local mpd (/tmp/mpd2.console_ccao);
>>     possible causes:
>>
>>     mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_ccao);
>>     possible causes:
>>
>>     1. no mpd is running on this host
>>
>>     2. an mpd is running but was started without a "console" (-n =
option)
>>
>>     It seems I can=A1=AFt run the job on head node(cluster) with pbs. =
But I
>>     can run 4-node job directly (without qsub).
>>
>>     When I use pbsnodes to check it seems all nodes are in free
>>     status. Can you help me on this? Many thanks!
>>
>>     Best wishes,
>>
>>     Chaucer
>>
>>
>------------------------------------------------------------------------=

>>
>>     *=B7=A2=BC=FE=C8=CB:* Krause, Roland =
[mailto:Roland.Krause at amtc-dresden.com]
>>     *=B7=A2=CB=CD=CA=B1=BC=E4:* 2007=C4=EA6=D4=C225=C8=D5 15:13
>>     *=CA=D5=BC=FE=C8=CB:* Chaucer Cao
>>     *=D6=F7=CC=E2:* RE: [torqueusers] how to get Environment =
Variables
>>
>>     Hi Chaucer,
>>
>>     Could you provide the part of your script, which is reading PBS
>>     env variables?
>>
>>     Regards,
>>
>>     Roland
>>
>>
>------------------------------------------------------------------------=

>>
>>         *From:* torqueusers-bounces at supercluster.org
>>         [mailto:torqueusers-bounces at supercluster.org] *On Behalf Of
>>         *Chaucer Cao
>>         *Sent:* Wednesday, June 20, 2007 7:16 PM
>>         *To:* torqueusers at supercluster.org
>>         <mailto:torqueusers at supercluster.org>
>>         *Subject:* [torqueusers] how to get Environment Variables
>>
>>         Hi all,
>>
>>         Does any one know how can I get the the PBS environment
>>         variables in the run script file. When I qsub my script file
>>         it gives:
>>
>>         PBS_NODEFILE: Undefined variable.
>>
>>         PBS_ENVIRONMENT: Undefined variable.
>>
>>         Many thanks!
>>
>>         Chaucer
>>
>> _______________________________________________
>>
>> torqueusers mailing list
>>
>> torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>>
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> =
------------------------------------------------------------------------
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers




More information about the torqueusers mailing list