[torqueusers] torque does not kill jobs when wall_time or cpu_time reached

Ken Nielson knielson at adaptivecomputing.com
Tue Jun 8 16:30:13 MDT 2010


On 06/08/2010 03:29 PM, Glen Beane wrote:
> On Tue, Jun 8, 2010 at 10:36 AM, Ken Nielson
> <knielson at adaptivecomputing.com>  wrote:
>    
>> On 06/07/2010 05:34 PM, David Singleton wrote:
>>      
>>> On 06/08/2010 01:50 AM, Ken Nielson wrote:
>>>
>>>        
>>>> On 06/07/2010 09:29 AM, Glen Beane wrote:
>>>>
>>>>          
>>>>> On Mon, Jun 7, 2010 at 11:21 AM, Ken Nielson
>>>>> <knielson at adaptivecomputing.com>      wrote:
>>>>>
>>>>>
>>>>>            
>>>>>> On 06/07/2010 09:10 AM, Glen Beane wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> On Mon, Jun 7, 2010 at 11:02 AM, Ken Nielson
>>>>>>> <knielson at adaptivecomputing.com>        wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>>> On 06/04/2010 08:14 PM, Glen Beane wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>>>> On Fri, Jun 4, 2010 at 5:37 PM, David Singleton
>>>>>>>>> <David.Singleton at anu.edu.au>          wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>>>> If procs is going to mean processors/cpus then I would suggest there needs
>>>>>>>>>> to be a lot of code added to align nodes and procs - they are specifying
>>>>>>>>>> the same thing.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                      
>>>>>>>>> Moab treats them the same if you do not specify ppn with your nodes
>>>>>>>>> request, however TORQUE is pretty much unaware of what -l procs=X
>>>>>>>>> means - it just passes the info along to Moab. I would like to see
>>>>>>>>> procs become a real torque resource that means give me X total
>>>>>>>>> processors on anywhere from 1 to X nodes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                    
>>>>>>>> Currently Moab interprets procs to mean give me all the processors on X
>>>>>>>> nodes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>> that doesn't seem correct.  I use procs all the time and I do not get
>>>>>>> this behavior from Moab (I've tried it with 5.3 and 5.4).  The
>>>>>>> behavior I expect and see is for Moab to give me X total processors
>>>>>>> spread across any number of nodes (the processors could all be on the
>>>>>>> same node, or they could be spread across many nodes depending on what
>>>>>>> is free at the time the job is scheduled to run).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> Glen
>>>>>>
>>>>>> Try doing a qsub -l proces=1<job.sh>. Then do a qstat -f and see what
>>>>>> the exec_host is set to.
>>>>>>
>>>>>> I am running Moab 5.4.
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> you must have some TORQUE defaults set, like ncpus that are
>>>>> interfering with procs.  Since -l procs does not set ncpus, your
>>>>> default is getting applied.
>>>>>
>>>>> gbeane at wulfgar:~>      echo "sleep 60" | qsub -l procs=1,walltime=00:01:00
>>>>> 69760.wulfgar.jax.org
>>>>> qstat -f 69760
>>>>> ...
>>>>> exec_host = cs-short-2/0
>>>>> ...
>>>>>
>>>>>
>>>>>            
>>>> Glen,
>>>>
>>>> You are right. I set those on my last set of problems with syntax.
>>>> Ironically they did not affect those resources.
>>>>
>>>> Ken
>>>>
>>>>          
>>> I rest my case.
>>>
>>>
>>> We treat ncpus as moab appears to treat procs.  But the server also
>>> aligns ncpus and nodes requests, eg.
>>>
>>> vayu2:~>    qsub -lncpus=4 -h
>>> w
>>> 194363.vu-pbs
>>> vayu2:~>    qstat -f 194363
>>> Job Id: 194363.vu-pbs
>>>        ...
>>>        Resource_List.ncpus = 4
>>>        Resource_List.neednodes = 4:ppn=1
>>>        Resource_List.nodect = 4
>>>        Resource_List.nodes = 4:ppn=1
>>>        ...
>>>
>>> vayu2:~>    qsub -lnodes=1:ppn=4 -h
>>> w
>>> 194365.vu-pbs
>>> vayu2:~>    qstat -f 194365
>>> Job Id: 194365.vu-pbs
>>>        ...
>>>        Resource_List.ncpus = 4
>>>        Resource_List.neednodes = 1:ppn=4
>>>        Resource_List.nodect = 1
>>>        Resource_List.nodes = 1:ppn=4
>>>        ...
>>>
>>> Any resource limits or defaults really apply to both ncpus (procs) and
>>> nodes.
>>>
>>> David
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>        
>> David,
>>
>> Thanks for your output. I am trying to sort this out without breaking
>> anyone. (not likely though)
>>
>> Ken
>>      
>
> in my opinion JOBNODEMATCHPOLICY  EXACTNODE should now be the default
> behavior since we have -l procs.  If I ask for 5 node and 8 processors
> per node then that is what I should get. I don't want 10 nodes with 4
> processors or 2 nodes with 16 processors and 1 with 8, etc.  If people
> don't care about the layout of their job they can use -l procs.
> hopefully with select things will be less ambiguous and will allow for
> greater flexibility (let the user be precise as they want, but also
> allow some way to say I don't care, just give me X processors).
>
> Also, the documentation should be clear that when you request a number
> of processors per node (ppn) or a number of processors (procs) it is
> talking about virtual processors as configured in pbs_server
> _______________________________________________
>
>    
Glen,

So if I ask for nodes=5:ppn=8 I should get 5 separate machines with 8 
processors.

Also -l procs=x should only be a request of x processors anywhere.

Ken



More information about the torqueusers mailing list