[torqueusers] Calculating the number of CPUs a job is using
Paul Millar
p.millar at physics.gla.ac.uk
Mon May 26 00:42:36 MDT 2008
Hi David,
Thanks for the information, comments below:
On Sunday 25 May 2008 23:52:18 David Singleton wrote:
> A word of warning - cputime and %CPU usage often have no bearing
> on "efficiency", particularly "parallel efficiency".
Yes, that makes sense, and something that's worth mentioning in the docs.
I guess (in general) dividing CPU- by wallclock- time gives an upper bound on
the efficiency as it's impossible to guess whether the processor is being
used "profitably" or merely caught in an unintended tight-loop or, as you
say, using some busy-waiting event loop.
> A hung MPI job (or threaded application) can use 100% CPU doing
> absolutely nothing.
Whilst that's true, it's also true for non-MPI jobs. Perhaps it's more likely
with MPI-jobs, due to how the user-space libraries are written: I believe
non-MPI jobs tend to use blocking IO (or kernel-based event loops for more
advanced applications) rather than busy-waiting.
> MPI (and threads) libraries can use "spin-waiting" or
> "busy-waiting" in blocking calls to get best performance
> particularly with "userspace" communication libraries.
Err, OK. Fair enough, I guess, and worth noting.
> Parallel efficiency is usually defined in terms of walltimes of
> various size jobs, i.e. its all relative:
> http://nf.apac.edu.au/training/MPIAppOpt/slides/slides.009.html
Interesting; if I understood correctly, this is a measure of efficiency based
on an idealised model, running a job on N nodes should take 1/N times as
long.
Whilst it looks like a much better estimate of efficiency, I guess one has to
run the job on a single-processor to get the reference wallclock-time.
> From outside the application (eg. just looking at PBS stats) it
> can be very difficult to come up with any measure of the efficiency
> of a single job.
True, in general (and especially for MPI jobs); but, assuming a job doesn't
use a busy-waiting loop and isn't caught in a tight-loop, the CPU- /
wallclock- time should give a reasonable estimate, no? The measure can then
be used to discover at least some grossly misbehaving jobs.
Cheers,
Paul.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20080526/4fe56d8c/attachment-0001.bin
More information about the torqueusers
mailing list