[torquedev] nodes, procs, tpn and ncpus

"Mgr. Šimon Tóth" SimonT at mail.muni.cz
Wed Jun 9 10:03:50 MDT 2010


Dne 9.6.2010 17:52, Ken Nielson napsal(a):
> On 06/09/2010 09:09 AM, "Mgr. Šimon Tóth" wrote:
>> Dne 9.6.2010 16:52, Ken Nielson napsal(a):
>>   
>>> On 06/09/2010 08:43 AM, "Mgr. Šimon Tóth" wrote:
>>>     
>>>> Dne 9.6.2010 16:40, Ken Nielson napsal(a):
>>>>
>>>>       
>>>>> On 06/09/2010 07:45 AM, Glen Beane wrote:
>>>>>
>>>>>         
>>>>>>> I am going to modify TORQUE so it will process these resources more
>>>>>>> like we expect.
>>>>>>>
>>>>>>>             
>>>>>>>>    procs=x will mean give me x processors anywhere.
>>>>>>>>
>>>>>>>>                
>>>>>>>
>>>>>>>              
>>>>>> great
>>>>>>
>>>>>>
>>>>>>
>>>>>>           
>>>>>>>>    nodes=x will mean the same as procs=x.
>>>>>>>>
>>>>>>>>                
>>>>>>>
>>>>>>>              
>>>>>> I don't think this should be the case... Moab reinterprets it to mean
>>>>>> the same thing, but historically with PBS that is not how has been
>>>>>> interpreted.
>>>>>>
>>>>>>
>>>>>>
>>>>>>           
>>>>>>>>    nodes=x:ppn=x will work as it currently does except that the
>>>>>>>> value for nodes will not be ignored.
>>>>>>>>
>>>>>>>>                
>>>>>>>
>>>>>>>              
>>>>>> what do you mean the value for nodes will not be ignored???  The
>>>>>> value
>>>>>> for nodes is NOT ignored now.
>>>>>>
>>>>>>
>>>>>> gbeane at wulfgar:~>    echo "sleep 60" | qsub -l
>>>>>> nodes=2:ppn=4,walltime=00:01:00
>>>>>> 69792.wulfgar.jax.org
>>>>>> gbeane at wulfgar:~>    qrun 69792
>>>>>> gbeane at wulfgar:~>    qstat -f 69792
>>>>>> ...
>>>>>>        exec_host =
>>>>>> cs-prod-2/3+cs-prod-2/2+cs-prod-2/1+cs-prod-2/0+cs-prod-1/3+cs
>>>>>>      -prod-1/2+cs-prod-1/1+cs-prod-1/0
>>>>>> ...
>>>>>>        Resource_List.neednodes = 2:ppn=4
>>>>>>        Resource_List.nodect = 2
>>>>>>        Resource_List.nodes = 2:ppn=4
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>            
>>>>> It seems you and Simon agree about how TORQUE is working. Following is
>>>>> what I have in qmgr.
>>>>>
>>>>> #
>>>>> # Create queues and set their attributes.
>>>>> #
>>>>> #
>>>>> # Create and define queue batch
>>>>> #
>>>>> create queue batch
>>>>> set queue batch queue_type = Execution
>>>>> set queue batch resources_default.nodes = 1
>>>>> set queue batch resources_default.walltime = 01:00:00
>>>>> set queue batch enabled = True
>>>>> set queue batch started = True
>>>>> #
>>>>> # Set server attributes.
>>>>> #
>>>>> set server scheduling = True
>>>>> set server acl_host_enable = True
>>>>> set server acl_hosts = l18
>>>>> set server acl_hosts += L18
>>>>> set server acl_hosts += kmn
>>>>> set server managers = ken at kmn
>>>>> set server operators = ken at kmn
>>>>> set server default_queue = batch
>>>>> set server log_events = 511
>>>>> set server mail_from = adm
>>>>> set server resources_available.nodect = 1024
>>>>> set server scheduler_iteration = 600
>>>>> set server node_check_rate = 150
>>>>> set server tcp_timeout = 6
>>>>> set server log_level = 6
>>>>> set server mom_job_sync = True
>>>>> set server keep_completed = 30
>>>>> set server next_job_number = 100
>>>>>
>>>>> Whenever I do -l nodes=x:ppn=y where x is greater than 1 I still only
>>>>> get one node allocated to the job.
>>>>>
>>>>>          
>>>> Well, what scheduler are you using? Schedulers can completely mask the
>>>> original nodespec. They can send their own nodespec in the run request.
>>>>
>>>>
>>>>        
>>> I am not using any scheduler. I run my jobs by hand. The scheduler will
>>> supersede any TORQUE interpretation.
>>>      
>> Well, then its just weird. Can you post the server log?
>>
>>    
> Simon,
> 
> Attached are two server logs.
> 
> node2ppn2 comes from the command "qsub -l nodes=2:ppn=2 psaux"  The
> resulting job id for this is 101.kmn.
> 
> nodes3 comes from the command "qsub -l nodes=3 psaux".  The job id to
> look for in this is 102.kmn.
> 
> Following is the qstat output for node2ppn2
> Job Id: 101.kmn
>     Job_Name = psaux
>     Job_Owner = ken at kmn
>     resources_used.cput = 00:00:00
>     resources_used.mem = 0kb
>     resources_used.vmem = 0kb
>     resources_used.walltime = 00:00:00
>     job_state = C
>     queue = batch
>     server = kmn
>     Checkpoint = u
>     ctime = Wed Jun  9 09:24:12 2010
>     Error_Path = kmn:/home/ken/psaux.e101
>     exec_host = kmn/1+kmn/0
>     Hold_Types = n
>     Join_Path = n
>     Keep_Files = n
>     Mail_Points = a
>     mtime = Wed Jun  9 09:24:16 2010
>     Output_Path = kmn:/home/ken/psaux.o101
>     Priority = 0
>     qtime = Wed Jun  9 09:24:12 2010
>     Rerunable = True
>     Resource_List.neednodes = 2:ppn=2
>     Resource_List.nodect = 2
>     Resource_List.nodes = 2:ppn=2
>     Resource_List.walltime = 00:01:00
>     session_id = 2800
>     substate = 59
>     Variable_List = PBS_O_HOME=/home/ken,PBS_O_LANG=en_US.utf8,
>     PBS_O_LOGNAME=ken,
>     PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/b
>     in:/usr/games:/usr/local/lib:usr/local/bin:/usr/local/lib/:/opt/slicke
>     dit/bin:/opt/moab/sbin/:/opt/moab/bin/,PBS_O_SHELL=/bin/bash,
>     PBS_O_HOST=kmn,PBS_SERVER=kmn,PBS_O_WORKDIR=/home/ken,
>     PBS_O_QUEUE=batch
>     euser = ken
>     egroup = ken
>     hashname = 101.kmn
>     queue_rank = 1
>     queue_type = E
>     etime = Wed Jun  9 09:24:12 2010
>     exit_status = 0
>     submit_args = -l nodes=2:ppn=2 psaux
>     start_time = Wed Jun  9 09:24:16 2010
>     Walltime.Remaining = 5
>     start_count = 1
>     fault_tolerant = False
>     comp_time = Wed Jun  9 09:24:16 2010
> 
> Following is the qstat output for nodes3Job Id: 102.kmn
>     Job_Name = psaux
>     Job_Owner = ken at kmn
>     resources_used.cput = 00:00:00
>     resources_used.mem = 0kb
>     resources_used.vmem = 0kb
>     resources_used.walltime = 00:00:00
>     job_state = C
>     queue = batch
>     server = kmn
>     Checkpoint = u
>     ctime = Wed Jun  9 09:45:11 2010
>     Error_Path = kmn:/home/ken/psaux.e102
>     exec_host = kmn/0
>     Hold_Types = n
>     Join_Path = n
>     Keep_Files = n
>     Mail_Points = a
>     mtime = Wed Jun  9 09:45:16 2010
>     Output_Path = kmn:/home/ken/psaux.o102
>     Priority = 0
>     qtime = Wed Jun  9 09:45:11 2010
>     Rerunable = True
>     Resource_List.neednodes = 3
>     Resource_List.nodect = 3
>     Resource_List.nodes = 3
>     Resource_List.walltime = 00:01:00
>     session_id = 3000
>     substate = 59
>     Variable_List = PBS_O_HOME=/home/ken,PBS_O_LANG=en_US.utf8,
>     PBS_O_LOGNAME=ken,
>     PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/b
>     in:/usr/games:/usr/local/lib:usr/local/bin:/usr/local/lib/:/opt/slicke
>     dit/bin:/opt/moab/sbin/:/opt/moab/bin/,PBS_O_SHELL=/bin/bash,
>     PBS_O_HOST=kmn,PBS_SERVER=kmn,PBS_O_WORKDIR=/home/ken,
>     PBS_O_QUEUE=batch
>     euser = ken
>     egroup = ken
>     hashname = 102.kmn
>     queue_rank = 1
>     queue_type = E
>     etime = Wed Jun  9 09:45:11 2010
>     exit_status = 0
>     submit_args = -l nodes=3 psaux
>     start_time = Wed Jun  9 09:45:16 2010
>     Walltime.Remaining = 5
>     start_count = 1
>     fault_tolerant = False
>     comp_time = Wed Jun  9 09:45:16 2010
> 
> Thanks for taking a look.
> 
> Ken

This seems like some weird bug. I haven't synced with upstream for a
while, but if you check the listelem() function, the first thing it does
is that it checks the number of nodes (stores in num). Then it goes
through the loop of every node until it finds enough matching nodes
(stored in hit).

-- 
Mgr. Šimon Tóth

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3366 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20100609/b0bd9b11/attachment.bin 


More information about the torquedev mailing list