[torqueusers] CONTRIBUTION: pestat utility version 2.0

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Sat Sep 29 12:12:22 MDT 2007


Hi Tom,

I'm glad if you like the pestat utility.  The pestat is actually a
great tool for diagnosing the kind of problem that you describe.

I have seen such NONE* zombie jobs many times, and it may happen
briefly while a job is finishing up, but it should never persist
for more than a few seconds.  Maybe you have used "qdel -p"
to kill a hung job ?  This may very well create zombie jobs
in the pbs_mom processes on the sister nodes, while pbs_server
thinks that these jobs are long gone.

Anyway, the solution is to make sure the node doesn't run any jobs
(you may want to offline the node and wait until its jobs finish).
Log in to the node and stop pbs_mom (service pbs_mom stop),
go to /var/spool/torque/mom_jobs/ and you will quite likely
find some job status files and directories belonging to those
zombie job ids.  Do an "rm -rf" of those zombie job files and
then start pbs_mom again - now it's memory of the jobs has
been wiped clean.  In all the cases that I have seen
this clears away the zombie jobs reliably (we currently run Torque
version 2.1.8).

Best regards,
Ole

> I copied and installed pestat - Works great on RHEL4. 
> 
> Two questions tho.
> 
>  ./pestat -f
> Listing only nodes that are flagged by *
>   node   state  load    pmem ncpu   mem   resi usrs tasks  jobids/users
>   node07  free  0.61*   3946   4   5866   1229  3/2    0
>   node08  free  0.00    3946   4   5866    203  2/2    0*   423 NONE* 578 
> NONE*
>   node12  free  0.99    3946   4   5866    324  3/3    1*   418 NONE* 422 
> NONE* 1034 rsnxgp
>   node13  free  0.75    3946   4   5866    163  2/2    1*   417 NONE* 924 
> NONE* 1023 rsnxgp
>   node55  free  1.20*   3946   4   5866    308  2/2    0*   420 NONE* 421 
> NONE*
>   node56  free  1.02*   3946   4   5866    225  3/2    0*   416 NONE* 419 
> NONE* 425 NONE* 577 NONE* 579 NONE*
>   silvio  free  0.66*   8112   4   8017   3686  37/9*   0
> 
> On node08 - is it saying that the mom logs think that there is a job 423 
> and 578 but there really is not (load=0.0)?
> 
> How can I kill or clean those records?


More information about the torqueusers mailing list