[torquedev] pbsnodes -a shows long gone jobs

Lennart Karlsson Lennart.Karlsson at nsc.liu.se
Thu Jul 2 01:01:09 MDT 2009


On Wed, 17 Jun 2009, Bogdan Costescu wrote:
> On Wed, 17 Jun 2009, Glen Beane wrote:
> 
> > You will notice the state is free, but this lists 5 jobs in the status
> > string.  These jobs are long gone from the system, in the case of job
> > number 37005 the job has been completed for OVER 2 MONTHS!
> 
> I have seen many times such jobs being kept in the list without 
> actually running, for cases where the job has died for some reason - 
> f.e. when nodes crash; on a ~7 years old cluster, the reliability is 
> quite poor so such events happen often. I haven't seen any bad effects 
> from this, except maybe some messages from nodes to the master asking 
> for the job to be killed (which the server ignores as the jobs are no 
> longer in its database). This is with Torque 2.1.10.
> 
> The nodes are rebuilt upon reboot so whatever state Torque keeps on 
> the local disk is lost, therefore I can't say whether a reboot does 
> something to these phantom jobs...

A bad effect from having these long-gone jobs still remembered by
Torque is that the

    $enablemomrestart 1

setting in the config file of pbs_mom goes out of effect, i.e. a

    touch whatever_path/sbin/pbs_mom

command does not any longer restart the pbs_mom.

Why are the jobs still remembered?

What is the recommended way to get Torque to forget about them?

(I run version 2.3.0-snap.200803221012.)

-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
   National Supercomputer Centre in Linkoping, Sweden
   http://www.nsc.liu.se




More information about the torquedev mailing list