[torqueusers] pbsnodes reports the same job running many times

Gus Correa gus at ldeo.columbia.edu
Thu Apr 19 16:47:35 MDT 2012


On 04/19/2012 06:25 PM, Leonardo Gregory Brunnet wrote:
> Hi Gus,
>
> Problem solved using simply "nodes=X".
> Thanks for  all suggestions!
>
> Leonardo
> P.S.  We never had Moab here... ncpus appeared probably
> from some foreign script ;) .
>

We also use Torque+Maui here.

I don't remember exactly, but ncpus may work under the barebones
Torque/PBS scheduler pbs_sched, besides Moab.
'ncpus' seems to be a bit troublesome with Maui, though.
The easy solution is to ask the users to stick to the
'nodes=X' syntax.
In a more elaborate solution you can write a qsub wrapper to
replace 'ncpus' the the 'nodes' and 'ppn' syntax.

Gus Correa

> On 19-04-2012 18:08, Gus Correa wrote:
>> Hi Leonardo
>>
>> On 04/19/2012 04:18 PM, Leonardo Gregory Brunnet wrote:
>>
>>> Hi Gus,
>>>
>>> Thanks for the answer.
>>>
>>> Yes, I am surprised that it is using four processors.
>>> As previously replied to David the argument used in the qsub script was
>>> ...
>>> #PBS -l ncpus=1
>>> ...
>>>
>> Somebody may correct me, but I think ncpus is a Moab thing,
>> which may or may not work right with Torque+Maui.
>> If you search this mailing list you will find other postings
>> about ncpus.
>>
>> Here we don't use ncpus.
>> We stick to the 'nodes=X:ppn=Y' syntax.
>> It works for us.
>>
>>
>>> and I suppose this is correct. But in fact I don't know the difference
>>> between this one
>>> above and
>>> #PBS -l nodes=1
>>>
>>> I have also checked that in  maui.cfg there is no specification for
>>>
>>> JOBNODEMATCHPOLICY
>>>
>>> but, in fact I don't know what is the default. If EXACTNODE is the default
>>> I should explicitely add a line to maui.cfg, correct?
>>>
>>>
>> Check JOBNODEMATCHPOLICY in the Maui Admin guide, although it
>> doesn't tell the default.
>>
>> http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php
>>
>> You can add the line with your option for JOBNODEMATCHPOLICY
>> to maui.cfg and restart maui.
>> We use EXACTNODE here.
>>
>> Gus Correa
>>
>>
>>> Leonardo
>>>
>>> On 19-04-2012 12:44, Gus Correa wrote:
>>>
>>>> Hi Leonardo
>>>>
>>>> Not sure if I understood the problem right.
>>>> I guess the job is legitimate and running,
>>>> but it surprises you that it is using four processors,
>>>> right?
>>>>
>>>> Did the user request four processors, perhaps,
>>>> even though he/she is running a serial job?
>>>> #PBS -l nodes=1:ppn=4
>>>> This may be reasonable, say, if his/her job needs a lot
>>>> of RAM, but the job is serial
>>>> [or if it is Matlab ... the king of memory-greediness ...]
>>>>
>>>> Also, beware of JOBNODEMATCHPOLICY in Maui [maui.cfg]:
>>>> http://www.adaptivecomputing.com/resources/docs/maui/a.fparameters.php
>>>> If set to EXACTNODE full nodes will be allocated.
>>>>
>>>> I hope this helps,
>>>> Gus Correa
>>>>
>>>> On 04/18/2012 06:26 PM, Leonardo Gregory Brunnet wrote:
>>>>
>>>>
>>>>> Dear All,
>>>>>
>>>>> In a fresh installed torque/maui cluster the server reports
>>>>> repeated execution of a job in a given  node. (There is no job running
>>>>> mpi)!.
>>>>>
>>>>> The output for pbsnodes for one given node gives:
>>>>>
>>>>> node131
>>>>>            state = job-exclusive
>>>>>            np = 4
>>>>>            properties = quadcore
>>>>>            ntype = cluster
>>>>>            jobs = 0/78898.master.cluster.XX.XX.XX,
>>>>> 1/78898.master.cluster.XX.XX.XX, 2/78898.master.cluster.XX.XX.XX,
>>>>> 3/78898.master.XX.XX.XX
>>>>>            status =
>>>>> rectime=1334786811,varattr=,jobs=78898.master.cluster.if.ufrgs.br,state=free,netload=2914588064,gres=,loadave=1.00,ncpus=4,physmem=3985876kb,availmem=4649240kb,totmem=5062188kb,idletime=535832,nusers=2,nsessions=2,sessions=2804
>>>>> 8224,uname=Linux node131 2.6.23-1-amd64 #1 SMP Fri Oct 12 23:45:48 UTC
>>>>> 2007 x86_64,opsys=linux
>>>>>            gpus = 0
>>>>>
>>>>> But, if we log in that node we will see what was expected, a single job.
>>>>> Since the torque server (or maui) "believes" all cpu's of that node are
>>>>> working,
>>>>> no other jobs are sent.  Any clues ?
>>>>>
>>>>> Thanks for the help!
>>>>>
>>>>> Leonardo
>>>>>
>>>>> Below, you find the output for
>>>>> # qmgr -c "p s"
>>>>>
>>>>> #
>>>>> # Create queues and set their attributes.
>>>>> #
>>>>> #
>>>>> # Create and define queue padrao
>>>>> #
>>>>> create queue padrao
>>>>> set queue padrao queue_type = Execution
>>>>> set queue padrao resources_default.nodes = 7
>>>>> set queue padrao resources_default.walltime = 01:00:00
>>>>> set queue padrao max_user_run = 5
>>>>> set queue padrao enabled = True
>>>>> set queue padrao started = True
>>>>> #
>>>>> # Create and define queue um_mes
>>>>> #
>>>>> create queue um_mes
>>>>> set queue um_mes queue_type = Execution
>>>>> set queue um_mes resources_max.nodes = 7
>>>>> set queue um_mes resources_default.nodes = 7
>>>>> set queue um_mes resources_default.walltime = 720:00:00
>>>>> set queue um_mes max_user_run = 5
>>>>> set queue um_mes enabled = True
>>>>> set queue um_mes started = True
>>>>> #
>>>>> # Create and define queue batch
>>>>> #
>>>>> create queue batch
>>>>> set queue batch queue_type = Execution
>>>>> set queue batch resources_default.nodes = 1
>>>>> set queue batch resources_default.walltime = 01:00:00
>>>>> set queue batch enabled = True
>>>>> set queue batch started = True
>>>>> #
>>>>> # Create and define queue um_dia
>>>>> #
>>>>> create queue um_dia
>>>>> set queue um_dia queue_type = Execution
>>>>> set queue um_dia resources_max.nodes = 7
>>>>> set queue um_dia resources_default.nodes = 7
>>>>> set queue um_dia resources_default.walltime = 24:00:00
>>>>> set queue um_dia max_user_run = 7
>>>>> set queue um_dia enabled = True
>>>>> set queue um_dia started = True
>>>>> #
>>>>> # Create and define queue uma_semana
>>>>> #
>>>>> create queue uma_semana
>>>>> set queue uma_semana queue_type = Execution
>>>>> set queue uma_semana resources_max.nodes = 7
>>>>> set queue uma_semana resources_default.nodes = 7
>>>>> set queue uma_semana resources_default.walltime = 168:00:00
>>>>> set queue uma_semana max_user_run = 5
>>>>> set queue uma_semana enabled = True
>>>>> set queue uma_semana started = True
>>>>> #
>>>>> # Create and define queue route
>>>>> #
>>>>> create queue route
>>>>> set queue route queue_type = Route
>>>>> set queue route route_destinations = padrao
>>>>> set queue route route_destinations += padrao2
>>>>> set queue route enabled = True
>>>>> set queue route started = True
>>>>> #
>>>>> # Set server attributes.
>>>>> #
>>>>> set server scheduling = True
>>>>> set server acl_hosts = master.cluster.XX.XX.XX
>>>>> set server acl_hosts += clusterapg
>>>>> set server managers = root at master.cluster.XX.XX.XX
>>>>> set server operators = root at master.cluster.XX.XX.XX
>>>>> set server default_queue = padrao
>>>>> set server log_events = 511
>>>>> set server mail_from = adm
>>>>> set server scheduler_iteration = 600
>>>>> set server node_check_rate = 150
>>>>> set server tcp_timeout = 6
>>>>> set server mom_job_sync = True
>>>>> set server keep_completed = 300
>>>>> set server next_job_number = 79033
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>>
>>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>



More information about the torqueusers mailing list