[torqueusers] Calculating the number of CPUs a job is using

David Singleton David.Singleton at anu.edu.au
Sun May 25 15:52:18 MDT 2008


A word of warning - cputime and %CPU usage often have no bearing
on "efficiency", particularly "parallel efficiency".  A hung MPI
job (or threaded application) can use 100% CPU doing absolutely
nothing.  MPI (and threads) libraries can use "spin-waiting" or
"busy-waiting" in blocking calls to get best performance
particularly with "userspace" communication libraries.  An
application spending a lot of time busy-waiting will be using a
lot of cputime but wont be very efficient.

Parallel efficiency is usually defined in terms of walltimes of
various size jobs, i.e. its all relative:
http://nf.apac.edu.au/training/MPIAppOpt/slides/slides.009.html

 From outside the application (eg. just looking at PBS stats) it
can be very difficult to come up with any measure of the efficiency
of a single job.

David

Paul Millar wrote:
> Hi,
> 
> I'm trying to calculate a mean-efficiency metric for running jobs and have a 
> few questions (apologies if they're well-document: I did look, but couldn't 
> find the answers).
> 
> To evaluate a job's efficiency, I want to calculate:
> 	CPU-time / (nCPUs * wallclock-time)
> 
> Where CPU-time is the aggregated total CPU-time for all processes associated 
> with the job, wallclock-time is the total time the has been in 
> state "Running", and nCPUs is the number CPUs allocated for the job.
> 
> (For "allocated" I mean that, by running the job, those CPUs may not be 
> allocated by another job.  This is more about how torque does its 
> book-keeping rather than how pbs_mom is implemented; whether the job's 
> processes are locked to only run on those processors that have been allocated 
> and whether the job actually uses those CPUs are separate questions.)
> 
> Torque keeps track of a job's CPU- and wallclock- time (with the above 
> semantics, I believe) but unfortunately it doesn't seem to provide the number 
> of processors that have been allocated (nCPUs, above).
> 
> So, I'm trying to understand how to calculate nCPUs  from the numbers provided 
> by qstat -f. 
> 
> I've looked at various sources of information (pbs_resource(7), admin-guide, 
> etc).  They mostly contain an almost complete picture, but they always seem 
> to be missing some part.
> 
> There's three resources that may describe the number of CPUs a job has been 
> allocated: nodes, nodect and ncpus.
> 
> 	nodes (string) describes what the user *wants* for their job.  The simplest 
> form is an integer (e.g., "5"), which says how many nodes the job should runs 
> on.  The nodes resource may take a more complex format, where it requests 
> variously nodes of different types.
> 
> 	nodect (integer, read-only) the number of nodes on which the job is running. 
> This number is calculated by server and cannot be specified by the user when 
> submitting a job.
> 
> 	ncpus (integer) the default number of CPUs per node requested by the user.   
> Only has an affect on those parts of "nodes" resource request that do not 
> include an explicit "ppn" statement.  If not specified, a value 1 is assumed. 
> 
> Some questions:
> 
> 	Is the description of nodes, nodect and ncpus above right?
> 
> 	Is nodect resource *always* present in "qstat -f" output for running jobs?
> 
> 	Is ncpus *only* present if the user explicitly specifies the value (e.g., 
> when submitting the job)?
> 
> 	A user submits two jobs with "-l nodes=1:ppn=1".  If the cluster contains 
> only machines with dual processors, will the job always run on two different 
> nodes or does torque allow the two jobs to run concurrently on the same node?
> 
> 	If a user submits two jobs, one with "-l nodes=1,ncpus=2" and the other 
> with "-l nodes=1:ppn=2", are the two requests always treated identically?  
> Can the system- and queue- configuration result in different behaviour for 
> the two jobs?
> 
> 	A user submits a single job with just "-l nodes=2".  Is a value of "ppn=1" 
> and "ncpus=1" *always* assumed, or can the system- or queue- configuration 
> affect this?
> 
> 	A user specifies a single job with "-l ncpus=2,nodes=3+5:ppn=7" on a cluster 
> containing only 11-CPU nodes.  How many CPUs is the job allocated (i.e., 
> won't be assigned concurrently to another job)?
> 
> Sorry for all the questions.
> 
> Cheers,
> 
> Paul.


More information about the torqueusers mailing list