[torqueusers] torque does not kill jobs when wall_time or cpu_time reached
David.Singleton at anu.edu.au
Tue Jun 8 16:37:04 MDT 2010
On 06/09/2010 07:29 AM, Glen Beane wrote:
> in my opinion JOBNODEMATCHPOLICY EXACTNODE should now be the default
> behavior since we have -l procs. If I ask for 5 node and 8 processors
> per node then that is what I should get. I don't want 10 nodes with 4
> processors or 2 nodes with 16 processors and 1 with 8, etc. If people
> don't care about the layout of their job they can use -l procs.
> hopefully with select things will be less ambiguous and will allow for
> greater flexibility (let the user be precise as they want, but also
> allow some way to say I don't care, just give me X processors).
Our experience is that very few users want detailed control over exactly
how many physical nodes they get - it seems to be only comp sci students
or similar with mistaken ideas about the value of such control. They
dont seem to realise that when they demand 1 cpu from each of 16 nodes,
variability in what is running on the other cpus on those nodes will make
a mockery of any performance numbers they deduce. Other reasons for
requesting exact nodes are usually to do with another resource (memory,
network interfaces, GPUs, ...). It should be requests for those resources/
node properties that get what the user wants, not the number of nodes.
We certainly have more users with hydrid MPI-OpenMP codes and for them,
nodes are really "virtual nodes", eg. a request for -lnodes=8:ppn=4 means
the job will be running with 8 MPI tasks each of which will have 4 threads -
the job needs any (the best?) set of cpus that can run that. A 32P SMP
might a perfectly acceptable solution.
I suspect hybrid codes will become more common.
So I would suggest EXACTNODE should not be the default but rather that
users thinking they want such detailed control should have to specify some
other option to show this (eg. -lother=exactnodes), ie. nodes are
"virtual nodes" unless the user specifies otherwise.
> Also, the documentation should be clear that when you request a number
> of processors per node (ppn) or a number of processors (procs) it is
> talking about virtual processors as configured in pbs_server
Note that virtual processors != physical processors causes a number of
problems. Certainly cpuset-aware MOMs are going to barf with such a setup
and the problem is that they dont know this is the config, only the server
and scheduler do. It sorta makes sense for the number of virtual processors
to be set in the MOM's config file so it can shut down NUMA/cpuset/binding
code when it doesn't make sense.
More information about the torqueusers