[torqueusers] torque does not kill jobs when wall_time or cpu_time reached
Ken Nielson
knielson at adaptivecomputing.com
Tue Jun 8 16:30:13 MDT 2010
On 06/08/2010 03:29 PM, Glen Beane wrote:
> On Tue, Jun 8, 2010 at 10:36 AM, Ken Nielson
> <knielson at adaptivecomputing.com> wrote:
>
>> On 06/07/2010 05:34 PM, David Singleton wrote:
>>
>>> On 06/08/2010 01:50 AM, Ken Nielson wrote:
>>>
>>>
>>>> On 06/07/2010 09:29 AM, Glen Beane wrote:
>>>>
>>>>
>>>>> On Mon, Jun 7, 2010 at 11:21 AM, Ken Nielson
>>>>> <knielson at adaptivecomputing.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On 06/07/2010 09:10 AM, Glen Beane wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Mon, Jun 7, 2010 at 11:02 AM, Ken Nielson
>>>>>>> <knielson at adaptivecomputing.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On 06/04/2010 08:14 PM, Glen Beane wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Fri, Jun 4, 2010 at 5:37 PM, David Singleton
>>>>>>>>> <David.Singleton at anu.edu.au> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> If procs is going to mean processors/cpus then I would suggest there needs
>>>>>>>>>> to be a lot of code added to align nodes and procs - they are specifying
>>>>>>>>>> the same thing.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Moab treats them the same if you do not specify ppn with your nodes
>>>>>>>>> request, however TORQUE is pretty much unaware of what -l procs=X
>>>>>>>>> means - it just passes the info along to Moab. I would like to see
>>>>>>>>> procs become a real torque resource that means give me X total
>>>>>>>>> processors on anywhere from 1 to X nodes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> Currently Moab interprets procs to mean give me all the processors on X
>>>>>>>> nodes.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> that doesn't seem correct. I use procs all the time and I do not get
>>>>>>> this behavior from Moab (I've tried it with 5.3 and 5.4). The
>>>>>>> behavior I expect and see is for Moab to give me X total processors
>>>>>>> spread across any number of nodes (the processors could all be on the
>>>>>>> same node, or they could be spread across many nodes depending on what
>>>>>>> is free at the time the job is scheduled to run).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Glen
>>>>>>
>>>>>> Try doing a qsub -l proces=1<job.sh>. Then do a qstat -f and see what
>>>>>> the exec_host is set to.
>>>>>>
>>>>>> I am running Moab 5.4.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> you must have some TORQUE defaults set, like ncpus that are
>>>>> interfering with procs. Since -l procs does not set ncpus, your
>>>>> default is getting applied.
>>>>>
>>>>> gbeane at wulfgar:~> echo "sleep 60" | qsub -l procs=1,walltime=00:01:00
>>>>> 69760.wulfgar.jax.org
>>>>> qstat -f 69760
>>>>> ...
>>>>> exec_host = cs-short-2/0
>>>>> ...
>>>>>
>>>>>
>>>>>
>>>> Glen,
>>>>
>>>> You are right. I set those on my last set of problems with syntax.
>>>> Ironically they did not affect those resources.
>>>>
>>>> Ken
>>>>
>>>>
>>> I rest my case.
>>>
>>>
>>> We treat ncpus as moab appears to treat procs. But the server also
>>> aligns ncpus and nodes requests, eg.
>>>
>>> vayu2:~> qsub -lncpus=4 -h
>>> w
>>> 194363.vu-pbs
>>> vayu2:~> qstat -f 194363
>>> Job Id: 194363.vu-pbs
>>> ...
>>> Resource_List.ncpus = 4
>>> Resource_List.neednodes = 4:ppn=1
>>> Resource_List.nodect = 4
>>> Resource_List.nodes = 4:ppn=1
>>> ...
>>>
>>> vayu2:~> qsub -lnodes=1:ppn=4 -h
>>> w
>>> 194365.vu-pbs
>>> vayu2:~> qstat -f 194365
>>> Job Id: 194365.vu-pbs
>>> ...
>>> Resource_List.ncpus = 4
>>> Resource_List.neednodes = 1:ppn=4
>>> Resource_List.nodect = 1
>>> Resource_List.nodes = 1:ppn=4
>>> ...
>>>
>>> Any resource limits or defaults really apply to both ncpus (procs) and
>>> nodes.
>>>
>>> David
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>> David,
>>
>> Thanks for your output. I am trying to sort this out without breaking
>> anyone. (not likely though)
>>
>> Ken
>>
>
> in my opinion JOBNODEMATCHPOLICY EXACTNODE should now be the default
> behavior since we have -l procs. If I ask for 5 node and 8 processors
> per node then that is what I should get. I don't want 10 nodes with 4
> processors or 2 nodes with 16 processors and 1 with 8, etc. If people
> don't care about the layout of their job they can use -l procs.
> hopefully with select things will be less ambiguous and will allow for
> greater flexibility (let the user be precise as they want, but also
> allow some way to say I don't care, just give me X processors).
>
> Also, the documentation should be clear that when you request a number
> of processors per node (ppn) or a number of processors (procs) it is
> talking about virtual processors as configured in pbs_server
> _______________________________________________
>
>
Glen,
So if I ask for nodes=5:ppn=8 I should get 5 separate machines with 8
processors.
Also -l procs=x should only be a request of x processors anywhere.
Ken
More information about the torqueusers
mailing list