[torquedev] pbsnodes -a shows long gone jobs
Lennart Karlsson
Lennart.Karlsson at nsc.liu.se
Thu Jul 2 01:01:09 MDT 2009
On Wed, 17 Jun 2009, Bogdan Costescu wrote:
> On Wed, 17 Jun 2009, Glen Beane wrote:
>
> > You will notice the state is free, but this lists 5 jobs in the status
> > string. These jobs are long gone from the system, in the case of job
> > number 37005 the job has been completed for OVER 2 MONTHS!
>
> I have seen many times such jobs being kept in the list without
> actually running, for cases where the job has died for some reason -
> f.e. when nodes crash; on a ~7 years old cluster, the reliability is
> quite poor so such events happen often. I haven't seen any bad effects
> from this, except maybe some messages from nodes to the master asking
> for the job to be killed (which the server ignores as the jobs are no
> longer in its database). This is with Torque 2.1.10.
>
> The nodes are rebuilt upon reboot so whatever state Torque keeps on
> the local disk is lost, therefore I can't say whether a reboot does
> something to these phantom jobs...
A bad effect from having these long-gone jobs still remembered by
Torque is that the
$enablemomrestart 1
setting in the config file of pbs_mom goes out of effect, i.e. a
touch whatever_path/sbin/pbs_mom
command does not any longer restart the pbs_mom.
Why are the jobs still remembered?
What is the recommended way to get Torque to forget about them?
(I run version 2.3.0-snap.200803221012.)
-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
National Supercomputer Centre in Linkoping, Sweden
http://www.nsc.liu.se
More information about the torquedev
mailing list