[torqueusers] Nodes that pbs reports are busy which are actually running a job

Coyle, James J [ITACD] jjc at iastate.edu
Thu Aug 12 13:32:21 MDT 2010


  I'd encourage you to check if the node is dedicated to a single batch job
before the kills.  Even though the current policy makes this uneccesary, 
at some oint you may change policy or re-use the code, and you'll
never rememeber the condition that made it safe to assume you were dedicaed
or why that assumption was necessary.

I implemented a node_cleanup that the epilogue script calls.

  The check to see if the node is dedicated is simply a count of the number of
times the node is comntained in $PBS_NODEFILE.  If that is the same as np 
for that node, the node is dediacted to the batch jobs. In that case it is 
OK to kill runaway processes.  I also call node_cleanup from the prologue, in case
errant processes were left over from a previous non-dedicated job.

Jim Coyle
Research Computing Group
115 Durham Center     http://jjc.public.iastate.edu
Iowa State Univ.
Ames Iowa 50011
From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] On Behalf Of Rahul Nabar [rpnabar at gmail.com]
Sent: Thursday, August 12, 2010 2:16 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Nodes that pbs reports are busy which are actually running a job

On Thu, Aug 12, 2010 at 10:43 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> If the user is running a new job on the same node,

How so? Won't the epilogue run before the new job gets assigned? Thus
the pkill should be safe, right?

> or you if share nodes across different jobs and users,
> this will kill legitimate processes.

Not a problem. Our nodes are exclusive. A user gets only full node at a time.
torqueusers mailing list
torqueusers at supercluster.org

More information about the torqueusers mailing list