[torqueusers] elapsed time and time use?

Troy Baer troy at osc.edu
Thu Dec 15 16:03:04 MST 2005

On Thu, 2005-12-15 at 15:43 -0500, Aquarijen wrote:
> I think I may be missing something.  I have seen references to 1.2.0p5
> having elapsed time issues, so I upgraded torque to 2.0.0p2, thinking
> that I might then see something other than 00:00 for elapsed time or
> 00:00:00 for time use.  I did this on my test cluster and then
> launched a 20 node job that runs for several hours.  It has been
> running for around 6 hours now and I still do not get a > 0 value for
> either column.  Is there a parameter I have not set correctly?  I'm at
> a loss.  If I have missed something in the documentation, please let
> me know.  Otherwise, an explanation for my users (ok, ok, ok, for me
> too!) as to why this is the case, would be helpful and very much
> appreciated.

I'm guessing that this is an MPI job using MPICH or MPICH2?  My guess is
that if you do a "qstat -f jobid | grep used", you'll see that it has
accumulated lots of wallclock time but no CPU time.

PBS can only track the CPU time utilization of processes that its
pbs_mom daemon starts.  Processes started by other mechanisms such as
rsh (which is used by MPICH's mpirun) cannot be tracked, because the
pbs_mom daemon doesn't have a parent-child relationship with them.

There is a solution to this, which is to replace the MPICH mpirun with
the mpiexec program developed here at OSC:


This uses the task management (TM) interface in PBS to start up the MPI
processes, and supports a number of MPICH and MPICH2 channel drivers.

If this is for something other than MPI, well...  good luck.  You could
always use the mpiexec code as a starting point to mimic the package's
parallel startup procedure using the TM interface.

Troy Baer                       troy at osc.edu
Science & Technology Support    http://www.osc.edu/hpc/
Ohio Supercomputer Center       614-292-9701

More information about the torqueusers mailing list