[torqueusers] Calculating the number of CPUs a job is using

Paul Millar p.millar at physics.gla.ac.uk
Mon May 26 00:42:36 MDT 2008


Hi David,

Thanks for the information, comments below:

On Sunday 25 May 2008 23:52:18 David Singleton wrote:
> A word of warning - cputime and %CPU usage often have no bearing
> on "efficiency", particularly "parallel efficiency".

Yes, that makes sense, and something that's worth mentioning in the docs.

I guess (in general) dividing CPU- by wallclock- time gives an upper bound on 
the efficiency as it's impossible to guess whether the processor is being 
used "profitably" or merely caught in an unintended tight-loop or, as you 
say, using some busy-waiting event loop.

> A hung MPI job (or threaded application) can use 100% CPU doing
> absolutely nothing.

Whilst that's true, it's also true for non-MPI jobs.  Perhaps it's more likely 
with MPI-jobs, due to how the user-space libraries are written: I believe 
non-MPI jobs tend to use blocking IO (or kernel-based event loops for more 
advanced applications) rather than busy-waiting.

> MPI (and threads) libraries can use "spin-waiting" or 
> "busy-waiting" in blocking calls to get best performance
> particularly with "userspace" communication libraries.

Err, OK.  Fair enough, I guess, and worth noting.

> Parallel efficiency is usually defined in terms of walltimes of
> various size jobs, i.e. its all relative:
> http://nf.apac.edu.au/training/MPIAppOpt/slides/slides.009.html

Interesting; if I understood correctly, this is a measure of efficiency based 
on an idealised model, running a job on N nodes should take 1/N times as 
long.

Whilst it looks like a much better estimate of efficiency, I guess one has to 
run the job on a single-processor to get the reference wallclock-time.


>  From outside the application (eg. just looking at PBS stats) it
> can be very difficult to come up with any measure of the efficiency
> of a single job.

True, in general (and especially for MPI jobs); but, assuming a job doesn't 
use a busy-waiting loop and isn't caught in a tight-loop, the CPU- / 
wallclock- time should give a reasonable estimate, no?  The measure can then 
be used to discover at least some grossly misbehaving jobs.

Cheers,

Paul.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20080526/4fe56d8c/attachment-0001.bin


More information about the torqueusers mailing list