[Mauiusers] Can´t get busy nodes

Gus Correa gus at ldeo.columbia.edu
Wed Sep 28 09:33:34 MDT 2011


Hi Fernando

Dennis already pointed out the first/main problem.
Your Torque/PBS script is not requesting a specific number of nodes
and cores/processors.
You can ask for 12 processors, even if your MPI command doesn't
use all of them:

#PBS -l nodes=1:ppn=12

[You can still do mpirun -np 8 if you want.]

This will prevent two jobs to run in the same node [which seems
to be your goal, if I understood it right].

I like to add also the queue name [even if it is the default]
and the job name [for documentation and stdout/stderr
naming consistency]

#PBS -q myqueue [whatever you called your queue]
#PBS -N myjob [15 characters at most, the rest gets truncated]

The #PBS clauses must be together and right after the #! /bin/sh line.

Ask your users to always add these lines to their jobs.
There is a feature of torque that allows you to write a wrapper
that will whatever you want to the job script,
but if your pool of users is small
you can just ask them to cooperate.

Of course there is much more that you can add.
'man qsub' and 'man pbs_resources' are good sources of information,
highly recommended reading.


Then there is what Antonio Messina mentioned, the cpuset feature
of Torque.
I don't know if you installed Torque with this feature enabled.
However, if you did, it will allow the specific cores to be
assigned to each process, which could allow node-sharing without
jobs stepping on each other toes.
However:
A) this requires a bit more of setup [not a lot, check the
list archives and the Torque Admin Guide]
B) if your users are cooperative and request 12 processors for each job,
and you're using the Maui 'JOBNODEMATCHPOLICY EXACTNODE' each job will
get to a single node anyway.

BTW, did you restart Maui after you added 'JOBNODEMATCHPOLICY EXACTNODE'
to the maui.cfg file?

I hope this helps,
Gus Correa


Fernando Caba wrote:
> Hi Gus, my node file /var/spool/torque /server_priv/nodes looks like:
> 
> [root at fe server_priv]# more nodes
> n10 np=12
> n11 np=12
> n12 np=12
> n13 np=12
> [root at fe server_priv]#
> 
> it is exact as your comment.
> 
> My script:
> 
> #!/bin/bash
> 
> cd $PBS_O_WORKDIR
> 
> mpirun -np 8 /usr/local/vasp/vasp
> 
> launch 8 vasp in one node. If i start one job more (with -np 8), 
> the job will run in the same node (n13). 
> So if i start another job with -np 8 
> (or -np 4), it will run in the same node n13.
> 
> I configured JOBNODEMATCHPOLICY EXACTNODE in maui.cfg, 
> but unfortunately the ran in node n13.
> This is an example of the output of top
> 
> top - 00:05:53 up 14 days,  6:47,  1 user,  load average: 4.18, 4.06, 4.09
> Mem:  15955108k total, 13287888k used,  2667220k free,   142168k buffers
> Swap: 67111528k total,    16672k used, 67094856k free, 11360332k cached
> 
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 21796 patricia  25   0  463m 291m  12m R 100.5  1.9 517:29.59 vasp
> 21797 patricia  25   0  448m 276m  11m R 100.2  1.8 518:51.49 vasp
> 21798 patricia  25   0  458m 287m  11m R 100.2  1.8 522:01.79 vasp
> 21799 patricia  25   0  448m 276m  11m R 99.9  1.8 519:04.25 vasp
>      1 root      15   0 10348  672  568 S  0.0  0.0   0:00.53 init
>      2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.06 migration/0
>      3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
>      4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
>      5 root      RT  -5     0    0    0 S  0.0  0.0   0:00.04 migration/1
> 
> The job that generate those 4 vasp process is:
> 
> #!/bin/bash
> 
> cd $PBS_O_WORKDIR
> 
> mpirun -np 4 /usr/local/vasp/vasp
> 
> Thanks
> 
> ----------------------------------------------------
> Ing. Fernando Caba
> Director General de Telecomunicaciones
> Universidad Nacional del Sur
> http://www.dgt.uns.edu.ar
> Tel/Fax: (54)-291-4595166
> Tel: (54)-291-4595101 int. 2050
> Avda. Alem 1253, (B8000CPB) Bahía Blanca - Argentina
> ----------------------------------------------------
> 
> 
> El 27/09/2011 08:07 PM, Gus Correa escribió:
>> Hi Fernando
>>
>> Did you try something like this in your
>> ${TORQUE}/server_priv/nodes file?
>>
>> frontend np=12 [skip this line if the frontend is not to do job work]
>> node1 np=12
>> node2 np=12
>> node3 np=12
>> node4 np=12
>>
>> This is probably the first thing to do.
>> It is not Maui, just plain Torque [actually pbs_server configuration].
>>
>> The lines above assume your nodes are called node1, ...
>> and the head node is called frontend,
>> in some name-resolvable manner [most likely
>> in your /etc/hosts file, most likely pointing to the nodes'
>> IP addresses in your cluster's private subnet, 192.168.X.X,
>> 10.X.X.X or equivalent].
>>
>> The 'np=12' clause will allow at most 12 *processes* per node.
>>
>>
>> [However, if VASP is *threaded*, say via OpenMP, then it won't
>> prevent that several threads are launched from each process.
>> To handle threaded you can use some tricks, such as requesting
>> more cores than processes.
>> Sorry, I am not familiar to VASP to be able to say more than this.]
>>
>> I would suggest that you take a look at the Torque Admin Manual
>> for more details:
>> http://www.adaptivecomputing.com/resources/docs/torque/
>>
>> There are further controls in Maui, such as
>> 'JOBNODEMATCHPOLICY EXACTNODE' in maui.cfg,
>> for instance, if you want full nodes allocated to each job,
>> as opposed to jobs sharing cores in a single node.
>> However, these choices may come later.
>> [You can change maui.cfg and restart the maui scheduler to
>> test various changes.]
>>
>> For Maui details see the Maui Admin Guide:
>> http://www.adaptivecomputing.com/resources/docs/maui/index.php
>>
>> I hope this helps,
>> Gus Correa
>>
>> Fernando Caba wrote:
>>> Hi every body, i am using torque 3.0.1 and maui 3.3.1 in a configuration
>>> composed by a front end and 4 nodes (2 processors, 6 cores each)
>>> totalizing 48 cores.
>>> I need to configure that in each node don´t run no more than 12 process
>>> (particular we are using vasp), so we wan´t no more than 12 vasp process
>>> by node.
>>> How can i configure this? I´m so confusing reading a lot of information
>>> from torque and maui configuration.
>>>
>>> Thank´s in advance.
>>>
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list