[torqueusers] Calculating the number of CPUs a job is using
p.millar at physics.gla.ac.uk
Sun May 25 08:44:01 MDT 2008
I'm trying to calculate a mean-efficiency metric for running jobs and have a
few questions (apologies if they're well-document: I did look, but couldn't
find the answers).
To evaluate a job's efficiency, I want to calculate:
CPU-time / (nCPUs * wallclock-time)
Where CPU-time is the aggregated total CPU-time for all processes associated
with the job, wallclock-time is the total time the has been in
state "Running", and nCPUs is the number CPUs allocated for the job.
(For "allocated" I mean that, by running the job, those CPUs may not be
allocated by another job. This is more about how torque does its
book-keeping rather than how pbs_mom is implemented; whether the job's
processes are locked to only run on those processors that have been allocated
and whether the job actually uses those CPUs are separate questions.)
Torque keeps track of a job's CPU- and wallclock- time (with the above
semantics, I believe) but unfortunately it doesn't seem to provide the number
of processors that have been allocated (nCPUs, above).
So, I'm trying to understand how to calculate nCPUs from the numbers provided
by qstat -f.
I've looked at various sources of information (pbs_resource(7), admin-guide,
etc). They mostly contain an almost complete picture, but they always seem
to be missing some part.
There's three resources that may describe the number of CPUs a job has been
allocated: nodes, nodect and ncpus.
nodes (string) describes what the user *wants* for their job. The simplest
form is an integer (e.g., "5"), which says how many nodes the job should runs
on. The nodes resource may take a more complex format, where it requests
variously nodes of different types.
nodect (integer, read-only) the number of nodes on which the job is running.
This number is calculated by server and cannot be specified by the user when
submitting a job.
ncpus (integer) the default number of CPUs per node requested by the user.
Only has an affect on those parts of "nodes" resource request that do not
include an explicit "ppn" statement. If not specified, a value 1 is assumed.
Is the description of nodes, nodect and ncpus above right?
Is nodect resource *always* present in "qstat -f" output for running jobs?
Is ncpus *only* present if the user explicitly specifies the value (e.g.,
when submitting the job)?
A user submits two jobs with "-l nodes=1:ppn=1". If the cluster contains
only machines with dual processors, will the job always run on two different
nodes or does torque allow the two jobs to run concurrently on the same node?
If a user submits two jobs, one with "-l nodes=1,ncpus=2" and the other
with "-l nodes=1:ppn=2", are the two requests always treated identically?
Can the system- and queue- configuration result in different behaviour for
the two jobs?
A user submits a single job with just "-l nodes=2". Is a value of "ppn=1"
and "ncpus=1" *always* assumed, or can the system- or queue- configuration
A user specifies a single job with "-l ncpus=2,nodes=3+5:ppn=7" on a cluster
containing only 11-CPU nodes. How many CPUs is the job allocated (i.e.,
won't be assigned concurrently to another job)?
Sorry for all the questions.
More information about the torqueusers