[torquedev] pbsnodes -a shows long gone jobs

Bogdan Costescu Bogdan.Costescu at iwr.uni-heidelberg.de
Wed Jun 17 07:49:53 MDT 2009


On Wed, 17 Jun 2009, Glen Beane wrote:

> You will notice the state is free, but this lists 5 jobs in the status
> string.  These jobs are long gone from the system, in the case of job
> number 37005 the job has been completed for OVER 2 MONTHS!

I have seen many times such jobs being kept in the list without 
actually running, for cases where the job has died for some reason - 
f.e. when nodes crash; on a ~7 years old cluster, the reliability is 
quite poor so such events happen often. I haven't seen any bad effects 
from this, except maybe some messages from nodes to the master asking 
for the job to be killed (which the server ignores as the jobs are no 
longer in its database). This is with Torque 2.1.10.

The nodes are rebuilt upon reboot so whatever state Torque keeps on 
the local disk is lost, therefore I can't say whether a reboot does 
something to these phantom jobs...

-- 
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de


More information about the torquedev mailing list