[torqueusers] torque does not kill jobs when wall_time or cpu_time reached

Glen Beane glen.beane at gmail.com
Tue Jun 8 17:54:39 MDT 2010





On Jun 8, 2010, at 6:37 PM, David Singleton  
<David.Singleton at anu.edu.au> wrote:

> On 06/09/2010 07:29 AM, Glen Beane wrote:
>>
>> in my opinion JOBNODEMATCHPOLICY  EXACTNODE should now be the default
>> behavior since we have -l procs.  If I ask for 5 node and 8  
>> processors
>> per node then that is what I should get. I don't want 10 nodes with 4
>> processors or 2 nodes with 16 processors and 1 with 8, etc.  If  
>> people
>> don't care about the layout of their job they can use -l procs.
>> hopefully with select things will be less ambiguous and will allow  
>> for
>> greater flexibility (let the user be precise as they want, but also
>> allow some way to say I don't care, just give me X processors).
>
> Our experience is that very few users want detailed control over  
> exactly
> how many physical nodes they get - it seems to be only comp sci  
> students
> or similar with mistaken ideas about the value of such control.  They
> dont seem to realise that when they demand 1 cpu from each of 16  
> nodes,
> variability in what is running on the other cpus on those nodes will  
> make
> a mockery of any performance numbers they deduce.  Other reasons for
> requesting exact nodes are usually to do with another resource  
> (memory,
> network interfaces, GPUs, ...).  It should be requests for those  
> resources/
> node properties that get what the user wants, not the number of nodes.
>
> We certainly have more users with hydrid MPI-OpenMP codes and for  
> them,
> nodes are really "virtual nodes", eg. a request for -lnodes=8:ppn=4  
> means
> the job will be running with 8 MPI tasks each of which will have 4  
> threads -
> the job needs any (the best?) set of cpus that can run that.  A 32P  
> SMP
> might a perfectly acceptable solution.
>

Select takes care of this. You request 8 task with 4 virtual procs per  
task. The scheduler can co-locate tasks. However if I go through the  
trouble of requesting a specific number of nodes then I should get them.


Replying from my phone so ignore the rest of this email. It is a pain  
to delete what I'm not commenting on.


> I suspect hybrid codes will become more common.
>
> So I would suggest EXACTNODE should not be the default but rather that
> users thinking they want such detailed control should have to  
> specify some
> other option to show this (eg. -lother=exactnodes), ie. nodes are
> "virtual nodes" unless the user specifies otherwise.
>
>>
>> Also, the documentation should be clear that when you request a  
>> number
>> of processors per node (ppn) or a number of processors (procs) it is
>> talking about virtual processors as configured in pbs_server
>
> True.
>
> Note that virtual processors != physical processors causes a number of
> problems. Certainly cpuset-aware MOMs are going to barf with such a  
> setup
> and the problem is that they dont know this is the config, only the  
> server
> and scheduler do.  It sorta makes sense for the number of virtual  
> processors
> to be set in the MOM's config file so it can shut down NUMA/cpuset/ 
> binding
> code when it doesn't make sense.
>
> Cheers,
> David
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list