[torquedev] [Bug 93] Resource management semantics of Torque need to be well defined

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Thu Oct 28 15:29:37 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=93

--- Comment #7 from David Singleton <David.Singleton at anu.edu.au> 2010-10-28 15:29:36 MDT ---
(In reply to comment #6)
> (In reply to comment #5)
> > Processes per node is often how it is explained, although you are right, it
> > isn't restricted in any way to actually limit the number of processes that can
> > be run. It may have originally been intended to be processors per node, but now
> > almost all processors intended for computing have multiple cores, making
> > processors per node completely ambiguous and therefore not very useful.
> > 
> > However, it is in the code in a few ways:
> > 
> > ppn is the number of times that nodename will appear in the $PBS_NODEFILE. This
> > is intended to be read by the mpi scripts on the program to then make that many
> > processes. There is nothing in TORQUE that stops the scripts from spawning more
> > processes though.
> > 
> > ppn is left completely configurable per node, and so the notion that it is tied
> > to the actual hardware is false. Often in production systems, ppn becomes cores
> > per node, because that's how many the system admin wants for optimal use. 
> > 
> > The fact of the matter is that ppn hasn't been clearly defined over time, and
> > what it has become in practice is probably best described as processes per
> > node. At any rate, changing this behavior would greatly disrupt life for *very*
> > many TORQUE users.
> 
> As Chris Samuel pointed out, the "p" in "ppn" meant "virtual processors".  A
> "virtual processor" can mean a core - for most us that is exactly what it
> means.  It can mean an "execution slot" for those sites that set node np
> greater than the number of physical cores (or hyperthread contexts).  The
> important thing is that it is a characteristic of the hardware/system/site.  It
> is not a property of the job.  The number of processes in a job is a property
> of a job.  In general there is no alignment. 
> 
> If I was to run a 16 thread OpenMP job, what value of ppn do I use?  The OpenMP
> app will have 1 process.  But then there will be 2 shells in the job so its
> likely to be 3 processes.  So ppn=3 ?  What I actually want is 16 bits of
> hardware that each can run a thread without conflict (as much as possible),
> i.e. I want 16 virtual processors.  
> 
> Yes, the use of the term "processor" needs to be spelt out as above. But at
> least it can be made technically accurate. The use of the term "process" cannot
> unless you want to turn it into a property of the system.

I'm not sure what change Simon wanted but, just to be clear, this looks like a
purely documentation issue to me. The only thing that has changed since the
"good ol' PBS days" is that someone started documenting "virtual processors" as
"processes" which is very confusing.  As far as I am concerned the behaviour is
OK, just the terminology is totally wrong.  Simon will have to explain what he
sees as the problem.

Note: I am not a Torque user, merely someone who would not like to see
confusion amongst users when using variants of PBS.

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list