[torqueusers] Nodes that pbs reports are busy which are actually running a job
Coyle, James J [ITACD]
jjc at iastate.edu
Thu Aug 12 13:32:21 MDT 2010
I'd encourage you to check if the node is dedicated to a single batch job
before the kills. Even though the current policy makes this uneccesary,
at some oint you may change policy or re-use the code, and you'll
never rememeber the condition that made it safe to assume you were dedicaed
or why that assumption was necessary.
I implemented a node_cleanup that the epilogue script calls.
The check to see if the node is dedicated is simply a count of the number of
times the node is comntained in $PBS_NODEFILE. If that is the same as np
for that node, the node is dediacted to the batch jobs. In that case it is
OK to kill runaway processes. I also call node_cleanup from the prologue, in case
errant processes were left over from a previous non-dedicated job.
Research Computing Group
115 Durham Center http://jjc.public.iastate.edu
Iowa State Univ.
Ames Iowa 50011
From: torqueusers-bounces at supercluster.org [torqueusers-bounces at supercluster.org] On Behalf Of Rahul Nabar [rpnabar at gmail.com]
Sent: Thursday, August 12, 2010 2:16 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Nodes that pbs reports are busy which are actually running a job
On Thu, Aug 12, 2010 at 10:43 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
> If the user is running a new job on the same node,
How so? Won't the epilogue run before the new job gets assigned? Thus
the pkill should be safe, right?
> or you if share nodes across different jobs and users,
> this will kill legitimate processes.
Not a problem. Our nodes are exclusive. A user gets only full node at a time.
torqueusers mailing list
torqueusers at supercluster.org
More information about the torqueusers