[torquedev] nodes, procs, tpn and ncpus

Ken Nielson knielson at adaptivecomputing.com
Wed Jun 9 09:52:14 MDT 2010


On 06/09/2010 09:09 AM, "Mgr. Šimon Tóth" wrote:
> Dne 9.6.2010 16:52, Ken Nielson napsal(a):
>    
>> On 06/09/2010 08:43 AM, "Mgr. Šimon Tóth" wrote:
>>      
>>> Dne 9.6.2010 16:40, Ken Nielson napsal(a):
>>>
>>>        
>>>> On 06/09/2010 07:45 AM, Glen Beane wrote:
>>>>
>>>>          
>>>>>> I am going to modify TORQUE so it will process these resources more
>>>>>> like we expect.
>>>>>>
>>>>>>              
>>>>>>>    procs=x will mean give me x processors anywhere.
>>>>>>>
>>>>>>>                
>>>>>>
>>>>>>              
>>>>> great
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>>>    nodes=x will mean the same as procs=x.
>>>>>>>
>>>>>>>                
>>>>>>
>>>>>>              
>>>>> I don't think this should be the case... Moab reinterprets it to mean
>>>>> the same thing, but historically with PBS that is not how has been
>>>>> interpreted.
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>>>    nodes=x:ppn=x will work as it currently does except that the
>>>>>>> value for nodes will not be ignored.
>>>>>>>
>>>>>>>                
>>>>>>
>>>>>>              
>>>>> what do you mean the value for nodes will not be ignored???  The value
>>>>> for nodes is NOT ignored now.
>>>>>
>>>>>
>>>>> gbeane at wulfgar:~>    echo "sleep 60" | qsub -l
>>>>> nodes=2:ppn=4,walltime=00:01:00
>>>>> 69792.wulfgar.jax.org
>>>>> gbeane at wulfgar:~>    qrun 69792
>>>>> gbeane at wulfgar:~>    qstat -f 69792
>>>>> ...
>>>>>        exec_host =
>>>>> cs-prod-2/3+cs-prod-2/2+cs-prod-2/1+cs-prod-2/0+cs-prod-1/3+cs
>>>>>      -prod-1/2+cs-prod-1/1+cs-prod-1/0
>>>>> ...
>>>>>        Resource_List.neednodes = 2:ppn=4
>>>>>        Resource_List.nodect = 2
>>>>>        Resource_List.nodes = 2:ppn=4
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> It seems you and Simon agree about how TORQUE is working. Following is
>>>> what I have in qmgr.
>>>>
>>>> #
>>>> # Create queues and set their attributes.
>>>> #
>>>> #
>>>> # Create and define queue batch
>>>> #
>>>> create queue batch
>>>> set queue batch queue_type = Execution
>>>> set queue batch resources_default.nodes = 1
>>>> set queue batch resources_default.walltime = 01:00:00
>>>> set queue batch enabled = True
>>>> set queue batch started = True
>>>> #
>>>> # Set server attributes.
>>>> #
>>>> set server scheduling = True
>>>> set server acl_host_enable = True
>>>> set server acl_hosts = l18
>>>> set server acl_hosts += L18
>>>> set server acl_hosts += kmn
>>>> set server managers = ken at kmn
>>>> set server operators = ken at kmn
>>>> set server default_queue = batch
>>>> set server log_events = 511
>>>> set server mail_from = adm
>>>> set server resources_available.nodect = 1024
>>>> set server scheduler_iteration = 600
>>>> set server node_check_rate = 150
>>>> set server tcp_timeout = 6
>>>> set server log_level = 6
>>>> set server mom_job_sync = True
>>>> set server keep_completed = 30
>>>> set server next_job_number = 100
>>>>
>>>> Whenever I do -l nodes=x:ppn=y where x is greater than 1 I still only
>>>> get one node allocated to the job.
>>>>
>>>>          
>>> Well, what scheduler are you using? Schedulers can completely mask the
>>> original nodespec. They can send their own nodespec in the run request.
>>>
>>>
>>>        
>> I am not using any scheduler. I run my jobs by hand. The scheduler will
>> supersede any TORQUE interpretation.
>>      
> Well, then its just weird. Can you post the server log?
>
>    
Simon,

Attached are two server logs.

node2ppn2 comes from the command "qsub -l nodes=2:ppn=2 psaux"  The 
resulting job id for this is 101.kmn.

nodes3 comes from the command "qsub -l nodes=3 psaux".  The job id to 
look for in this is 102.kmn.

Following is the qstat output for node2ppn2
Job Id: 101.kmn
     Job_Name = psaux
     Job_Owner = ken at kmn
     resources_used.cput = 00:00:00
     resources_used.mem = 0kb
     resources_used.vmem = 0kb
     resources_used.walltime = 00:00:00
     job_state = C
     queue = batch
     server = kmn
     Checkpoint = u
     ctime = Wed Jun  9 09:24:12 2010
     Error_Path = kmn:/home/ken/psaux.e101
     exec_host = kmn/1+kmn/0
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = a
     mtime = Wed Jun  9 09:24:16 2010
     Output_Path = kmn:/home/ken/psaux.o101
     Priority = 0
     qtime = Wed Jun  9 09:24:12 2010
     Rerunable = True
     Resource_List.neednodes = 2:ppn=2
     Resource_List.nodect = 2
     Resource_List.nodes = 2:ppn=2
     Resource_List.walltime = 00:01:00
     session_id = 2800
     substate = 59
     Variable_List = PBS_O_HOME=/home/ken,PBS_O_LANG=en_US.utf8,
     PBS_O_LOGNAME=ken,
     PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/b
     in:/usr/games:/usr/local/lib:usr/local/bin:/usr/local/lib/:/opt/slicke
     dit/bin:/opt/moab/sbin/:/opt/moab/bin/,PBS_O_SHELL=/bin/bash,
     PBS_O_HOST=kmn,PBS_SERVER=kmn,PBS_O_WORKDIR=/home/ken,
     PBS_O_QUEUE=batch
     euser = ken
     egroup = ken
     hashname = 101.kmn
     queue_rank = 1
     queue_type = E
     etime = Wed Jun  9 09:24:12 2010
     exit_status = 0
     submit_args = -l nodes=2:ppn=2 psaux
     start_time = Wed Jun  9 09:24:16 2010
     Walltime.Remaining = 5
     start_count = 1
     fault_tolerant = False
     comp_time = Wed Jun  9 09:24:16 2010

Following is the qstat output for nodes3Job Id: 102.kmn
     Job_Name = psaux
     Job_Owner = ken at kmn
     resources_used.cput = 00:00:00
     resources_used.mem = 0kb
     resources_used.vmem = 0kb
     resources_used.walltime = 00:00:00
     job_state = C
     queue = batch
     server = kmn
     Checkpoint = u
     ctime = Wed Jun  9 09:45:11 2010
     Error_Path = kmn:/home/ken/psaux.e102
     exec_host = kmn/0
     Hold_Types = n
     Join_Path = n
     Keep_Files = n
     Mail_Points = a
     mtime = Wed Jun  9 09:45:16 2010
     Output_Path = kmn:/home/ken/psaux.o102
     Priority = 0
     qtime = Wed Jun  9 09:45:11 2010
     Rerunable = True
     Resource_List.neednodes = 3
     Resource_List.nodect = 3
     Resource_List.nodes = 3
     Resource_List.walltime = 00:01:00
     session_id = 3000
     substate = 59
     Variable_List = PBS_O_HOME=/home/ken,PBS_O_LANG=en_US.utf8,
     PBS_O_LOGNAME=ken,
     PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/b
     in:/usr/games:/usr/local/lib:usr/local/bin:/usr/local/lib/:/opt/slicke
     dit/bin:/opt/moab/sbin/:/opt/moab/bin/,PBS_O_SHELL=/bin/bash,
     PBS_O_HOST=kmn,PBS_SERVER=kmn,PBS_O_WORKDIR=/home/ken,
     PBS_O_QUEUE=batch
     euser = ken
     egroup = ken
     hashname = 102.kmn
     queue_rank = 1
     queue_type = E
     etime = Wed Jun  9 09:45:11 2010
     exit_status = 0
     submit_args = -l nodes=3 psaux
     start_time = Wed Jun  9 09:45:16 2010
     Walltime.Remaining = 5
     start_count = 1
     fault_tolerant = False
     comp_time = Wed Jun  9 09:45:16 2010

Thanks for taking a look.

Ken




-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nodes2ppn2
Url: http://www.supercluster.org/pipermail/torquedev/attachments/20100609/56879db9/attachment-0002.pl 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nodes3
Url: http://www.supercluster.org/pipermail/torquedev/attachments/20100609/56879db9/attachment-0003.pl 


More information about the torquedev mailing list