[Mauiusers] Can´t get busy nodes

Gus Correa gus at ldeo.columbia.edu
Wed Sep 28 13:07:44 MDT 2011


Hi Fernando

Did you restart maui after you changed maui.cfg? [service maui restart]

Any chances that what you see is still residual from old jobs,
submitted before you changed the maui configuration and job scripts
[#PBS -l nodes=1:ppn=12]?

For more help from everybody in the list,
it may be useful if you send the output of:

qmgr -c 'p s'

${TORQUE}/bin/pbsnodes

${MAUI}/bin/showconfig

ps -ef |grep  maui

service maui status
service pbs_server status
service pbs_sched status [just in case it is also running ...]
service pbs_mom status
service pbs status

I hope this helps,
Gus Correa


Fernando Caba wrote:
> Hi everybody, thanks for all answers.
> I try all that you point out:
> 
> including
> #PBS -l nodes=1:ppn=12
> 
> adding
> 
> JOBNODEMATCHPOLICY EXACTNODE
> 
> to maui.cfg
> 
> but nothing of this work. I´m thinking that the problem is in another 
> config parameter (maui or torque).
> 
> I will reading more about all.
> 
> Thanks!!
> 
> ----------------------------------------------------
> Ing. Fernando Caba
> Director General de Telecomunicaciones
> Universidad Nacional del Sur
> http://www.dgt.uns.edu.ar
> Tel/Fax: (54)-291-4595166
> Tel: (54)-291-4595101 int. 2050
> Avda. Alem 1253, (B8000CPB) Bahía Blanca - Argentina
> ----------------------------------------------------
> 
> 
> El 28/09/2011 12:33 PM, Gus Correa escribió:
>> Hi Fernando
>>
>> Dennis already pointed out the first/main problem.
>> Your Torque/PBS script is not requesting a specific number of nodes
>> and cores/processors.
>> You can ask for 12 processors, even if your MPI command doesn't
>> use all of them:
>>
>> #PBS -l nodes=1:ppn=12
>>
>> [You can still do mpirun -np 8 if you want.]
>>
>> This will prevent two jobs to run in the same node [which seems
>> to be your goal, if I understood it right].
>>
>> I like to add also the queue name [even if it is the default]
>> and the job name [for documentation and stdout/stderr
>> naming consistency]
>>
>> #PBS -q myqueue [whatever you called your queue]
>> #PBS -N myjob [15 characters at most, the rest gets truncated]
>>
>> The #PBS clauses must be together and right after the #! /bin/sh line.
>>
>> Ask your users to always add these lines to their jobs.
>> There is a feature of torque that allows you to write a wrapper
>> that will whatever you want to the job script,
>> but if your pool of users is small
>> you can just ask them to cooperate.
>>
>> Of course there is much more that you can add.
>> 'man qsub' and 'man pbs_resources' are good sources of information,
>> highly recommended reading.
>>
>>
>> Then there is what Antonio Messina mentioned, the cpuset feature
>> of Torque.
>> I don't know if you installed Torque with this feature enabled.
>> However, if you did, it will allow the specific cores to be
>> assigned to each process, which could allow node-sharing without
>> jobs stepping on each other toes.
>> However:
>> A) this requires a bit more of setup [not a lot, check the
>> list archives and the Torque Admin Guide]
>> B) if your users are cooperative and request 12 processors for each job,
>> and you're using the Maui 'JOBNODEMATCHPOLICY EXACTNODE' each job will
>> get to a single node anyway.
>>
>> BTW, did you restart Maui after you added 'JOBNODEMATCHPOLICY EXACTNODE'
>> to the maui.cfg file?
>>
>> I hope this helps,
>> Gus Correa
>>
>>
>> Fernando Caba wrote:
>>> Hi Gus, my node file /var/spool/torque /server_priv/nodes looks like:
>>>
>>> [root at fe server_priv]# more nodes
>>> n10 np=12
>>> n11 np=12
>>> n12 np=12
>>> n13 np=12
>>> [root at fe server_priv]#
>>>
>>> it is exact as your comment.
>>>
>>> My script:
>>>
>>> #!/bin/bash
>>>
>>> cd $PBS_O_WORKDIR
>>>
>>> mpirun -np 8 /usr/local/vasp/vasp
>>>
>>> launch 8 vasp in one node. If i start one job more (with -np 8),
>>> the job will run in the same node (n13).
>>> So if i start another job with -np 8
>>> (or -np 4), it will run in the same node n13.
>>>
>>> I configured JOBNODEMATCHPOLICY EXACTNODE in maui.cfg,
>>> but unfortunately the ran in node n13.
>>> This is an example of the output of top
>>>
>>> top - 00:05:53 up 14 days,  6:47,  1 user,  load average: 4.18, 4.06, 4.09
>>> Mem:  15955108k total, 13287888k used,  2667220k free,   142168k buffers
>>> Swap: 67111528k total,    16672k used, 67094856k free, 11360332k cached
>>>
>>>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>> 21796 patricia  25   0  463m 291m  12m R 100.5  1.9 517:29.59 vasp
>>> 21797 patricia  25   0  448m 276m  11m R 100.2  1.8 518:51.49 vasp
>>> 21798 patricia  25   0  458m 287m  11m R 100.2  1.8 522:01.79 vasp
>>> 21799 patricia  25   0  448m 276m  11m R 99.9  1.8 519:04.25 vasp
>>>       1 root      15   0 10348  672  568 S  0.0  0.0   0:00.53 init
>>>       2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.06 migration/0
>>>       3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
>>>       4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
>>>       5 root      RT  -5     0    0    0 S  0.0  0.0   0:00.04 migration/1
>>>
>>> The job that generate those 4 vasp process is:
>>>
>>> #!/bin/bash
>>>
>>> cd $PBS_O_WORKDIR
>>>
>>> mpirun -np 4 /usr/local/vasp/vasp
>>>
>>> Thanks
>>>
>>> ----------------------------------------------------
>>> Ing. Fernando Caba
>>> Director General de Telecomunicaciones
>>> Universidad Nacional del Sur
>>> http://www.dgt.uns.edu.ar
>>> Tel/Fax: (54)-291-4595166
>>> Tel: (54)-291-4595101 int. 2050
>>> Avda. Alem 1253, (B8000CPB) Bahía Blanca - Argentina
>>> ----------------------------------------------------
>>>
>>>
>>> El 27/09/2011 08:07 PM, Gus Correa escribió:
>>>> Hi Fernando
>>>>
>>>> Did you try something like this in your
>>>> ${TORQUE}/server_priv/nodes file?
>>>>
>>>> frontend np=12 [skip this line if the frontend is not to do job work]
>>>> node1 np=12
>>>> node2 np=12
>>>> node3 np=12
>>>> node4 np=12
>>>>
>>>> This is probably the first thing to do.
>>>> It is not Maui, just plain Torque [actually pbs_server configuration].
>>>>
>>>> The lines above assume your nodes are called node1, ...
>>>> and the head node is called frontend,
>>>> in some name-resolvable manner [most likely
>>>> in your /etc/hosts file, most likely pointing to the nodes'
>>>> IP addresses in your cluster's private subnet, 192.168.X.X,
>>>> 10.X.X.X or equivalent].
>>>>
>>>> The 'np=12' clause will allow at most 12 *processes* per node.
>>>>
>>>>
>>>> [However, if VASP is *threaded*, say via OpenMP, then it won't
>>>> prevent that several threads are launched from each process.
>>>> To handle threaded you can use some tricks, such as requesting
>>>> more cores than processes.
>>>> Sorry, I am not familiar to VASP to be able to say more than this.]
>>>>
>>>> I would suggest that you take a look at the Torque Admin Manual
>>>> for more details:
>>>> http://www.adaptivecomputing.com/resources/docs/torque/
>>>>
>>>> There are further controls in Maui, such as
>>>> 'JOBNODEMATCHPOLICY EXACTNODE' in maui.cfg,
>>>> for instance, if you want full nodes allocated to each job,
>>>> as opposed to jobs sharing cores in a single node.
>>>> However, these choices may come later.
>>>> [You can change maui.cfg and restart the maui scheduler to
>>>> test various changes.]
>>>>
>>>> For Maui details see the Maui Admin Guide:
>>>> http://www.adaptivecomputing.com/resources/docs/maui/index.php
>>>>
>>>> I hope this helps,
>>>> Gus Correa
>>>>
>>>> Fernando Caba wrote:
>>>>> Hi every body, i am using torque 3.0.1 and maui 3.3.1 in a configuration
>>>>> composed by a front end and 4 nodes (2 processors, 6 cores each)
>>>>> totalizing 48 cores.
>>>>> I need to configure that in each node don´t run no more than 12 process
>>>>> (particular we are using vasp), so we wan´t no more than 12 vasp process
>>>>> by node.
>>>>> How can i configure this? I´m so confusing reading a lot of information
>>>>> from torque and maui configuration.
>>>>>
>>>>> Thank´s in advance.
>>>>>
>>>> _______________________________________________
>>>> mauiusers mailing list
>>>> mauiusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>
>>> _______________________________________________
>>> mauiusers mailing list
>>> mauiusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>> _______________________________________________
>> mauiusers mailing list
>> mauiusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>
> _______________________________________________
> mauiusers mailing list
> mauiusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/mauiusers



More information about the mauiusers mailing list