[torquedev] [Bug 93] Resource management semantics of Torque need to be well defined

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Thu Oct 28 17:29:33 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=93

--- Comment #9 from Simon Toth <SimonT at mail.muni.cz> 2010-10-28 17:29:33 MDT ---
(In reply to comment #7)
> (In reply to comment #6)
> > (In reply to comment #5)
> > > Processes per node is often how it is explained, although you are right, it
> > > isn't restricted in any way to actually limit the number of processes that can
> > > be run. It may have originally been intended to be processors per node, but now
> > > almost all processors intended for computing have multiple cores, making
> > > processors per node completely ambiguous and therefore not very useful.
> > > 
> > > However, it is in the code in a few ways:
> > > 
> > > ppn is the number of times that nodename will appear in the $PBS_NODEFILE. This
> > > is intended to be read by the mpi scripts on the program to then make that many
> > > processes. There is nothing in TORQUE that stops the scripts from spawning more
> > > processes though.
> > > 
> > > ppn is left completely configurable per node, and so the notion that it is tied
> > > to the actual hardware is false. Often in production systems, ppn becomes cores
> > > per node, because that's how many the system admin wants for optimal use. 
> > > 
> > > The fact of the matter is that ppn hasn't been clearly defined over time, and
> > > what it has become in practice is probably best described as processes per
> > > node. At any rate, changing this behavior would greatly disrupt life for *very*
> > > many TORQUE users.
> > 
> > As Chris Samuel pointed out, the "p" in "ppn" meant "virtual processors".  A
> > "virtual processor" can mean a core - for most us that is exactly what it
> > means.  It can mean an "execution slot" for those sites that set node np
> > greater than the number of physical cores (or hyperthread contexts).  The
> > important thing is that it is a characteristic of the hardware/system/site.  It
> > is not a property of the job.  The number of processes in a job is a property
> > of a job.  In general there is no alignment. 
> > 
> > If I was to run a 16 thread OpenMP job, what value of ppn do I use?  The OpenMP
> > app will have 1 process.  But then there will be 2 shells in the job so its
> > likely to be 3 processes.  So ppn=3 ?  What I actually want is 16 bits of
> > hardware that each can run a thread without conflict (as much as possible),
> > i.e. I want 16 virtual processors.  
> > 
> > Yes, the use of the term "processor" needs to be spelt out as above. But at
> > least it can be made technically accurate. The use of the term "process" cannot
> > unless you want to turn it into a property of the system.
> 
> I'm not sure what change Simon wanted but, just to be clear, this looks like a
> purely documentation issue to me. The only thing that has changed since the
> "good ol' PBS days" is that someone started documenting "virtual processors" as
> "processes" which is very confusing.  As far as I am concerned the behaviour is
> OK, just the terminology is totally wrong.  Simon will have to explain what he
> sees as the problem.
> 
> Note: I am not a Torque user, merely someone who would not like to see
> confusion amongst users when using variants of PBS.

It would be awesome if it would be just a documentation issue. Particularly the
node interprets ppn as processes. If you look into the code of the server, it
doesn't really make any difference, but it still creates a sub-node for each
process.

One problem with using ppn as cpus/cores is that when you request pmem or pvmem
or panything you will get ppn*amount, which can counter intuitive.

I personally don't think that per-process resources make much sense these days
(since the number of processes isn't limited by Torque anyway). That includes
per-process resources.

But again either way is OK for me, I just think we should define which way it
works.

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list