[torqueusers] Calculating the number of CPUs a job is using
David.Singleton at anu.edu.au
Sun May 25 15:52:18 MDT 2008
A word of warning - cputime and %CPU usage often have no bearing
on "efficiency", particularly "parallel efficiency". A hung MPI
job (or threaded application) can use 100% CPU doing absolutely
nothing. MPI (and threads) libraries can use "spin-waiting" or
"busy-waiting" in blocking calls to get best performance
particularly with "userspace" communication libraries. An
application spending a lot of time busy-waiting will be using a
lot of cputime but wont be very efficient.
Parallel efficiency is usually defined in terms of walltimes of
various size jobs, i.e. its all relative:
From outside the application (eg. just looking at PBS stats) it
can be very difficult to come up with any measure of the efficiency
of a single job.
Paul Millar wrote:
> I'm trying to calculate a mean-efficiency metric for running jobs and have a
> few questions (apologies if they're well-document: I did look, but couldn't
> find the answers).
> To evaluate a job's efficiency, I want to calculate:
> CPU-time / (nCPUs * wallclock-time)
> Where CPU-time is the aggregated total CPU-time for all processes associated
> with the job, wallclock-time is the total time the has been in
> state "Running", and nCPUs is the number CPUs allocated for the job.
> (For "allocated" I mean that, by running the job, those CPUs may not be
> allocated by another job. This is more about how torque does its
> book-keeping rather than how pbs_mom is implemented; whether the job's
> processes are locked to only run on those processors that have been allocated
> and whether the job actually uses those CPUs are separate questions.)
> Torque keeps track of a job's CPU- and wallclock- time (with the above
> semantics, I believe) but unfortunately it doesn't seem to provide the number
> of processors that have been allocated (nCPUs, above).
> So, I'm trying to understand how to calculate nCPUs from the numbers provided
> by qstat -f.
> I've looked at various sources of information (pbs_resource(7), admin-guide,
> etc). They mostly contain an almost complete picture, but they always seem
> to be missing some part.
> There's three resources that may describe the number of CPUs a job has been
> allocated: nodes, nodect and ncpus.
> nodes (string) describes what the user *wants* for their job. The simplest
> form is an integer (e.g., "5"), which says how many nodes the job should runs
> on. The nodes resource may take a more complex format, where it requests
> variously nodes of different types.
> nodect (integer, read-only) the number of nodes on which the job is running.
> This number is calculated by server and cannot be specified by the user when
> submitting a job.
> ncpus (integer) the default number of CPUs per node requested by the user.
> Only has an affect on those parts of "nodes" resource request that do not
> include an explicit "ppn" statement. If not specified, a value 1 is assumed.
> Some questions:
> Is the description of nodes, nodect and ncpus above right?
> Is nodect resource *always* present in "qstat -f" output for running jobs?
> Is ncpus *only* present if the user explicitly specifies the value (e.g.,
> when submitting the job)?
> A user submits two jobs with "-l nodes=1:ppn=1". If the cluster contains
> only machines with dual processors, will the job always run on two different
> nodes or does torque allow the two jobs to run concurrently on the same node?
> If a user submits two jobs, one with "-l nodes=1,ncpus=2" and the other
> with "-l nodes=1:ppn=2", are the two requests always treated identically?
> Can the system- and queue- configuration result in different behaviour for
> the two jobs?
> A user submits a single job with just "-l nodes=2". Is a value of "ppn=1"
> and "ncpus=1" *always* assumed, or can the system- or queue- configuration
> affect this?
> A user specifies a single job with "-l ncpus=2,nodes=3+5:ppn=7" on a cluster
> containing only 11-CPU nodes. How many CPUs is the job allocated (i.e.,
> won't be assigned concurrently to another job)?
> Sorry for all the questions.
More information about the torqueusers