[torqueusers] elapsed time and time use?
Troy Baer
troy at osc.edu
Thu Dec 15 16:03:04 MST 2005
On Thu, 2005-12-15 at 15:43 -0500, Aquarijen wrote:
> I think I may be missing something. I have seen references to 1.2.0p5
> having elapsed time issues, so I upgraded torque to 2.0.0p2, thinking
> that I might then see something other than 00:00 for elapsed time or
> 00:00:00 for time use. I did this on my test cluster and then
> launched a 20 node job that runs for several hours. It has been
> running for around 6 hours now and I still do not get a > 0 value for
> either column. Is there a parameter I have not set correctly? I'm at
> a loss. If I have missed something in the documentation, please let
> me know. Otherwise, an explanation for my users (ok, ok, ok, for me
> too!) as to why this is the case, would be helpful and very much
> appreciated.
I'm guessing that this is an MPI job using MPICH or MPICH2? My guess is
that if you do a "qstat -f jobid | grep used", you'll see that it has
accumulated lots of wallclock time but no CPU time.
PBS can only track the CPU time utilization of processes that its
pbs_mom daemon starts. Processes started by other mechanisms such as
rsh (which is used by MPICH's mpirun) cannot be tracked, because the
pbs_mom daemon doesn't have a parent-child relationship with them.
There is a solution to this, which is to replace the MPICH mpirun with
the mpiexec program developed here at OSC:
http://www.osc.edu/~pw/mpiexec/
This uses the task management (TM) interface in PBS to start up the MPI
processes, and supports a number of MPICH and MPICH2 channel drivers.
If this is for something other than MPI, well... good luck. You could
always use the mpiexec code as a starting point to mimic the package's
parallel startup procedure using the TM interface.
--Troy
--
Troy Baer troy at osc.edu
Science & Technology Support http://www.osc.edu/hpc/
Ohio Supercomputer Center 614-292-9701
More information about the torqueusers
mailing list