[torqueusers] Calculating the number of CPUs a job is using

Paul Millar p.millar at physics.gla.ac.uk
Sun May 25 08:44:01 MDT 2008


Hi,

I'm trying to calculate a mean-efficiency metric for running jobs and have a 
few questions (apologies if they're well-document: I did look, but couldn't 
find the answers).

To evaluate a job's efficiency, I want to calculate:
	CPU-time / (nCPUs * wallclock-time)

Where CPU-time is the aggregated total CPU-time for all processes associated 
with the job, wallclock-time is the total time the has been in 
state "Running", and nCPUs is the number CPUs allocated for the job.

(For "allocated" I mean that, by running the job, those CPUs may not be 
allocated by another job.  This is more about how torque does its 
book-keeping rather than how pbs_mom is implemented; whether the job's 
processes are locked to only run on those processors that have been allocated 
and whether the job actually uses those CPUs are separate questions.)

Torque keeps track of a job's CPU- and wallclock- time (with the above 
semantics, I believe) but unfortunately it doesn't seem to provide the number 
of processors that have been allocated (nCPUs, above).

So, I'm trying to understand how to calculate nCPUs  from the numbers provided 
by qstat -f. 

I've looked at various sources of information (pbs_resource(7), admin-guide, 
etc).  They mostly contain an almost complete picture, but they always seem 
to be missing some part.

There's three resources that may describe the number of CPUs a job has been 
allocated: nodes, nodect and ncpus.

	nodes (string) describes what the user *wants* for their job.  The simplest 
form is an integer (e.g., "5"), which says how many nodes the job should runs 
on.  The nodes resource may take a more complex format, where it requests 
variously nodes of different types.

	nodect (integer, read-only) the number of nodes on which the job is running. 
This number is calculated by server and cannot be specified by the user when 
submitting a job.

	ncpus (integer) the default number of CPUs per node requested by the user.   
Only has an affect on those parts of "nodes" resource request that do not 
include an explicit "ppn" statement.  If not specified, a value 1 is assumed. 

Some questions:

	Is the description of nodes, nodect and ncpus above right?

	Is nodect resource *always* present in "qstat -f" output for running jobs?

	Is ncpus *only* present if the user explicitly specifies the value (e.g., 
when submitting the job)?

	A user submits two jobs with "-l nodes=1:ppn=1".  If the cluster contains 
only machines with dual processors, will the job always run on two different 
nodes or does torque allow the two jobs to run concurrently on the same node?

	If a user submits two jobs, one with "-l nodes=1,ncpus=2" and the other 
with "-l nodes=1:ppn=2", are the two requests always treated identically?  
Can the system- and queue- configuration result in different behaviour for 
the two jobs?

	A user submits a single job with just "-l nodes=2".  Is a value of "ppn=1" 
and "ncpus=1" *always* assumed, or can the system- or queue- configuration 
affect this?

	A user specifies a single job with "-l ncpus=2,nodes=3+5:ppn=7" on a cluster 
containing only 11-CPU nodes.  How many CPUs is the job allocated (i.e., 
won't be assigned concurrently to another job)?

Sorry for all the questions.

Cheers,

Paul.


More information about the torqueusers mailing list