[torqueusers] torque does not kill jobs when wall_time or cpu_time reached

Ken Nielson knielson at adaptivecomputing.com
Tue Jun 8 08:36:33 MDT 2010


On 06/07/2010 05:34 PM, David Singleton wrote:
> On 06/08/2010 01:50 AM, Ken Nielson wrote:
>    
>> On 06/07/2010 09:29 AM, Glen Beane wrote:
>>      
>>> On Mon, Jun 7, 2010 at 11:21 AM, Ken Nielson
>>> <knielson at adaptivecomputing.com>    wrote:
>>>
>>>        
>>>> On 06/07/2010 09:10 AM, Glen Beane wrote:
>>>>
>>>>          
>>>>> On Mon, Jun 7, 2010 at 11:02 AM, Ken Nielson
>>>>> <knielson at adaptivecomputing.com>      wrote:
>>>>>
>>>>>
>>>>>            
>>>>>> On 06/04/2010 08:14 PM, Glen Beane wrote:
>>>>>>
>>>>>>
>>>>>>              
>>>>>>> On Fri, Jun 4, 2010 at 5:37 PM, David Singleton
>>>>>>> <David.Singleton at anu.edu.au>        wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>>>> If procs is going to mean processors/cpus then I would suggest there needs
>>>>>>>> to be a lot of code added to align nodes and procs - they are specifying
>>>>>>>> the same thing.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                  
>>>>>>> Moab treats them the same if you do not specify ppn with your nodes
>>>>>>> request, however TORQUE is pretty much unaware of what -l procs=X
>>>>>>> means - it just passes the info along to Moab. I would like to see
>>>>>>> procs become a real torque resource that means give me X total
>>>>>>> processors on anywhere from 1 to X nodes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> Currently Moab interprets procs to mean give me all the processors on X
>>>>>> nodes.
>>>>>>
>>>>>>
>>>>>>              
>>>>> that doesn't seem correct.  I use procs all the time and I do not get
>>>>> this behavior from Moab (I've tried it with 5.3 and 5.4).  The
>>>>> behavior I expect and see is for Moab to give me X total processors
>>>>> spread across any number of nodes (the processors could all be on the
>>>>> same node, or they could be spread across many nodes depending on what
>>>>> is free at the time the job is scheduled to run).
>>>>>
>>>>>
>>>>>            
>>>> Glen
>>>>
>>>> Try doing a qsub -l proces=1<job.sh>. Then do a qstat -f and see what
>>>> the exec_host is set to.
>>>>
>>>> I am running Moab 5.4.
>>>>
>>>>
>>>>          
>>> you must have some TORQUE defaults set, like ncpus that are
>>> interfering with procs.  Since -l procs does not set ncpus, your
>>> default is getting applied.
>>>
>>> gbeane at wulfgar:~>    echo "sleep 60" | qsub -l procs=1,walltime=00:01:00
>>> 69760.wulfgar.jax.org
>>> qstat -f 69760
>>> ...
>>> exec_host = cs-short-2/0
>>> ...
>>>
>>>        
>> Glen,
>>
>> You are right. I set those on my last set of problems with syntax.
>> Ironically they did not affect those resources.
>>
>> Ken
>>      
>
> I rest my case.
>
>
> We treat ncpus as moab appears to treat procs.  But the server also
> aligns ncpus and nodes requests, eg.
>
> vayu2:~>  qsub -lncpus=4 -h
> w
> 194363.vu-pbs
> vayu2:~>  qstat -f 194363
> Job Id: 194363.vu-pbs
>       ...
>       Resource_List.ncpus = 4
>       Resource_List.neednodes = 4:ppn=1
>       Resource_List.nodect = 4
>       Resource_List.nodes = 4:ppn=1
>       ...
>
> vayu2:~>  qsub -lnodes=1:ppn=4 -h
> w
> 194365.vu-pbs
> vayu2:~>  qstat -f 194365
> Job Id: 194365.vu-pbs
>       ...
>       Resource_List.ncpus = 4
>       Resource_List.neednodes = 1:ppn=4
>       Resource_List.nodect = 1
>       Resource_List.nodes = 1:ppn=4
>       ...
>
> Any resource limits or defaults really apply to both ncpus (procs) and
> nodes.
>
> David
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>    
David,

Thanks for your output. I am trying to sort this out without breaking 
anyone. (not likely though)

Ken


More information about the torqueusers mailing list