[torquedev] nodes, procs, tpn and ncpus
Ken Nielson
knielson at adaptivecomputing.com
Wed Jun 9 09:52:14 MDT 2010
On 06/09/2010 09:09 AM, "Mgr. Šimon Tóth" wrote:
> Dne 9.6.2010 16:52, Ken Nielson napsal(a):
>
>> On 06/09/2010 08:43 AM, "Mgr. Šimon Tóth" wrote:
>>
>>> Dne 9.6.2010 16:40, Ken Nielson napsal(a):
>>>
>>>
>>>> On 06/09/2010 07:45 AM, Glen Beane wrote:
>>>>
>>>>
>>>>>> I am going to modify TORQUE so it will process these resources more
>>>>>> like we expect.
>>>>>>
>>>>>>
>>>>>>> procs=x will mean give me x processors anywhere.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>> great
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> nodes=x will mean the same as procs=x.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>> I don't think this should be the case... Moab reinterprets it to mean
>>>>> the same thing, but historically with PBS that is not how has been
>>>>> interpreted.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> nodes=x:ppn=x will work as it currently does except that the
>>>>>>> value for nodes will not be ignored.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>> what do you mean the value for nodes will not be ignored??? The value
>>>>> for nodes is NOT ignored now.
>>>>>
>>>>>
>>>>> gbeane at wulfgar:~> echo "sleep 60" | qsub -l
>>>>> nodes=2:ppn=4,walltime=00:01:00
>>>>> 69792.wulfgar.jax.org
>>>>> gbeane at wulfgar:~> qrun 69792
>>>>> gbeane at wulfgar:~> qstat -f 69792
>>>>> ...
>>>>> exec_host =
>>>>> cs-prod-2/3+cs-prod-2/2+cs-prod-2/1+cs-prod-2/0+cs-prod-1/3+cs
>>>>> -prod-1/2+cs-prod-1/1+cs-prod-1/0
>>>>> ...
>>>>> Resource_List.neednodes = 2:ppn=4
>>>>> Resource_List.nodect = 2
>>>>> Resource_List.nodes = 2:ppn=4
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> It seems you and Simon agree about how TORQUE is working. Following is
>>>> what I have in qmgr.
>>>>
>>>> #
>>>> # Create queues and set their attributes.
>>>> #
>>>> #
>>>> # Create and define queue batch
>>>> #
>>>> create queue batch
>>>> set queue batch queue_type = Execution
>>>> set queue batch resources_default.nodes = 1
>>>> set queue batch resources_default.walltime = 01:00:00
>>>> set queue batch enabled = True
>>>> set queue batch started = True
>>>> #
>>>> # Set server attributes.
>>>> #
>>>> set server scheduling = True
>>>> set server acl_host_enable = True
>>>> set server acl_hosts = l18
>>>> set server acl_hosts += L18
>>>> set server acl_hosts += kmn
>>>> set server managers = ken at kmn
>>>> set server operators = ken at kmn
>>>> set server default_queue = batch
>>>> set server log_events = 511
>>>> set server mail_from = adm
>>>> set server resources_available.nodect = 1024
>>>> set server scheduler_iteration = 600
>>>> set server node_check_rate = 150
>>>> set server tcp_timeout = 6
>>>> set server log_level = 6
>>>> set server mom_job_sync = True
>>>> set server keep_completed = 30
>>>> set server next_job_number = 100
>>>>
>>>> Whenever I do -l nodes=x:ppn=y where x is greater than 1 I still only
>>>> get one node allocated to the job.
>>>>
>>>>
>>> Well, what scheduler are you using? Schedulers can completely mask the
>>> original nodespec. They can send their own nodespec in the run request.
>>>
>>>
>>>
>> I am not using any scheduler. I run my jobs by hand. The scheduler will
>> supersede any TORQUE interpretation.
>>
> Well, then its just weird. Can you post the server log?
>
>
Simon,
Attached are two server logs.
node2ppn2 comes from the command "qsub -l nodes=2:ppn=2 psaux" The
resulting job id for this is 101.kmn.
nodes3 comes from the command "qsub -l nodes=3 psaux". The job id to
look for in this is 102.kmn.
Following is the qstat output for node2ppn2
Job Id: 101.kmn
Job_Name = psaux
Job_Owner = ken at kmn
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = C
queue = batch
server = kmn
Checkpoint = u
ctime = Wed Jun 9 09:24:12 2010
Error_Path = kmn:/home/ken/psaux.e101
exec_host = kmn/1+kmn/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Jun 9 09:24:16 2010
Output_Path = kmn:/home/ken/psaux.o101
Priority = 0
qtime = Wed Jun 9 09:24:12 2010
Rerunable = True
Resource_List.neednodes = 2:ppn=2
Resource_List.nodect = 2
Resource_List.nodes = 2:ppn=2
Resource_List.walltime = 00:01:00
session_id = 2800
substate = 59
Variable_List = PBS_O_HOME=/home/ken,PBS_O_LANG=en_US.utf8,
PBS_O_LOGNAME=ken,
PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/b
in:/usr/games:/usr/local/lib:usr/local/bin:/usr/local/lib/:/opt/slicke
dit/bin:/opt/moab/sbin/:/opt/moab/bin/,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=kmn,PBS_SERVER=kmn,PBS_O_WORKDIR=/home/ken,
PBS_O_QUEUE=batch
euser = ken
egroup = ken
hashname = 101.kmn
queue_rank = 1
queue_type = E
etime = Wed Jun 9 09:24:12 2010
exit_status = 0
submit_args = -l nodes=2:ppn=2 psaux
start_time = Wed Jun 9 09:24:16 2010
Walltime.Remaining = 5
start_count = 1
fault_tolerant = False
comp_time = Wed Jun 9 09:24:16 2010
Following is the qstat output for nodes3Job Id: 102.kmn
Job_Name = psaux
Job_Owner = ken at kmn
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = C
queue = batch
server = kmn
Checkpoint = u
ctime = Wed Jun 9 09:45:11 2010
Error_Path = kmn:/home/ken/psaux.e102
exec_host = kmn/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Jun 9 09:45:16 2010
Output_Path = kmn:/home/ken/psaux.o102
Priority = 0
qtime = Wed Jun 9 09:45:11 2010
Rerunable = True
Resource_List.neednodes = 3
Resource_List.nodect = 3
Resource_List.nodes = 3
Resource_List.walltime = 00:01:00
session_id = 3000
substate = 59
Variable_List = PBS_O_HOME=/home/ken,PBS_O_LANG=en_US.utf8,
PBS_O_LOGNAME=ken,
PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/b
in:/usr/games:/usr/local/lib:usr/local/bin:/usr/local/lib/:/opt/slicke
dit/bin:/opt/moab/sbin/:/opt/moab/bin/,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=kmn,PBS_SERVER=kmn,PBS_O_WORKDIR=/home/ken,
PBS_O_QUEUE=batch
euser = ken
egroup = ken
hashname = 102.kmn
queue_rank = 1
queue_type = E
etime = Wed Jun 9 09:45:11 2010
exit_status = 0
submit_args = -l nodes=3 psaux
start_time = Wed Jun 9 09:45:16 2010
Walltime.Remaining = 5
start_count = 1
fault_tolerant = False
comp_time = Wed Jun 9 09:45:16 2010
Thanks for taking a look.
Ken
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nodes2ppn2
Url: http://www.supercluster.org/pipermail/torquedev/attachments/20100609/56879db9/attachment-0002.pl
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nodes3
Url: http://www.supercluster.org/pipermail/torquedev/attachments/20100609/56879db9/attachment-0003.pl
More information about the torquedev
mailing list