[torqueusers] torque does not kill jobs when wall_time or cpu_time reached

Glen Beane glen.beane at gmail.com
Tue Jun 8 15:29:04 MDT 2010


On Tue, Jun 8, 2010 at 10:36 AM, Ken Nielson
<knielson at adaptivecomputing.com> wrote:
> On 06/07/2010 05:34 PM, David Singleton wrote:
>> On 06/08/2010 01:50 AM, Ken Nielson wrote:
>>
>>> On 06/07/2010 09:29 AM, Glen Beane wrote:
>>>
>>>> On Mon, Jun 7, 2010 at 11:21 AM, Ken Nielson
>>>> <knielson at adaptivecomputing.com>    wrote:
>>>>
>>>>
>>>>> On 06/07/2010 09:10 AM, Glen Beane wrote:
>>>>>
>>>>>
>>>>>> On Mon, Jun 7, 2010 at 11:02 AM, Ken Nielson
>>>>>> <knielson at adaptivecomputing.com>      wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 06/04/2010 08:14 PM, Glen Beane wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Fri, Jun 4, 2010 at 5:37 PM, David Singleton
>>>>>>>> <David.Singleton at anu.edu.au>        wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> If procs is going to mean processors/cpus then I would suggest there needs
>>>>>>>>> to be a lot of code added to align nodes and procs - they are specifying
>>>>>>>>> the same thing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Moab treats them the same if you do not specify ppn with your nodes
>>>>>>>> request, however TORQUE is pretty much unaware of what -l procs=X
>>>>>>>> means - it just passes the info along to Moab. I would like to see
>>>>>>>> procs become a real torque resource that means give me X total
>>>>>>>> processors on anywhere from 1 to X nodes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Currently Moab interprets procs to mean give me all the processors on X
>>>>>>> nodes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> that doesn't seem correct.  I use procs all the time and I do not get
>>>>>> this behavior from Moab (I've tried it with 5.3 and 5.4).  The
>>>>>> behavior I expect and see is for Moab to give me X total processors
>>>>>> spread across any number of nodes (the processors could all be on the
>>>>>> same node, or they could be spread across many nodes depending on what
>>>>>> is free at the time the job is scheduled to run).
>>>>>>
>>>>>>
>>>>>>
>>>>> Glen
>>>>>
>>>>> Try doing a qsub -l proces=1<job.sh>. Then do a qstat -f and see what
>>>>> the exec_host is set to.
>>>>>
>>>>> I am running Moab 5.4.
>>>>>
>>>>>
>>>>>
>>>> you must have some TORQUE defaults set, like ncpus that are
>>>> interfering with procs.  Since -l procs does not set ncpus, your
>>>> default is getting applied.
>>>>
>>>> gbeane at wulfgar:~>    echo "sleep 60" | qsub -l procs=1,walltime=00:01:00
>>>> 69760.wulfgar.jax.org
>>>> qstat -f 69760
>>>> ...
>>>> exec_host = cs-short-2/0
>>>> ...
>>>>
>>>>
>>> Glen,
>>>
>>> You are right. I set those on my last set of problems with syntax.
>>> Ironically they did not affect those resources.
>>>
>>> Ken
>>>
>>
>> I rest my case.
>>
>>
>> We treat ncpus as moab appears to treat procs.  But the server also
>> aligns ncpus and nodes requests, eg.
>>
>> vayu2:~>  qsub -lncpus=4 -h
>> w
>> 194363.vu-pbs
>> vayu2:~>  qstat -f 194363
>> Job Id: 194363.vu-pbs
>>       ...
>>       Resource_List.ncpus = 4
>>       Resource_List.neednodes = 4:ppn=1
>>       Resource_List.nodect = 4
>>       Resource_List.nodes = 4:ppn=1
>>       ...
>>
>> vayu2:~>  qsub -lnodes=1:ppn=4 -h
>> w
>> 194365.vu-pbs
>> vayu2:~>  qstat -f 194365
>> Job Id: 194365.vu-pbs
>>       ...
>>       Resource_List.ncpus = 4
>>       Resource_List.neednodes = 1:ppn=4
>>       Resource_List.nodect = 1
>>       Resource_List.nodes = 1:ppn=4
>>       ...
>>
>> Any resource limits or defaults really apply to both ncpus (procs) and
>> nodes.
>>
>> David
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
> David,
>
> Thanks for your output. I am trying to sort this out without breaking
> anyone. (not likely though)
>
> Ken


in my opinion JOBNODEMATCHPOLICY  EXACTNODE should now be the default
behavior since we have -l procs.  If I ask for 5 node and 8 processors
per node then that is what I should get. I don't want 10 nodes with 4
processors or 2 nodes with 16 processors and 1 with 8, etc.  If people
don't care about the layout of their job they can use -l procs.
hopefully with select things will be less ambiguous and will allow for
greater flexibility (let the user be precise as they want, but also
allow some way to say I don't care, just give me X processors).

Also, the documentation should be clear that when you request a number
of processors per node (ppn) or a number of processors (procs) it is
talking about virtual processors as configured in pbs_server


More information about the torqueusers mailing list