[torqueusers] Nodes that pbs reports are busy which are actually running a job
JMRUSHTON at qinetiq.com
Thu Aug 12 03:56:18 MDT 2010
Be careful about assuming that one user = one job. When our new cluster
was delivered someone had configured the epilogue to kill off all
processes belonging to the user, but with 8 or 16 cores per node we were
caught when one user had several jobs running. The first job to finish
killed off the user's other jobs.
Tel: 01959 514777, Mobile: 07939 219057
email: jmrushton at QinetiQ.com
QinetiQ - Delivering customer-focused solutions
Please consider the environment before printing this email.
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick
Sent: 11 August 2010 23:13
To: Torque Users Mailing List
Subject: Re: [torqueusers] Nodes that pbs reports are busy which are
actually running a job
On Wed, Aug 11, 2010 at 04:59:07PM -0500, Rahul Nabar alleged:
> On Wed, Aug 11, 2010 at 4:53 PM, Garrick Staples <garrick at usc.edu>
> > Nope, it doesn't have a job. What you have are stale processes from
an old job.
> Thanks! I killed them, Does PBS cleanup processes after a job ends
> automatically? Or is there a suitable flag? These are non-shared nodes
> so no risk of stepping on another jobs processes. All 8 cores are
> always assigned to same user.
> If not is it a OK fix to put a pkill in the epilogue for all normal
> usernames. Any caveats? Or better ideas?
It will kill processes that it knows about. This includes any children
of the batch script and any processes launched through the TM interface.
Any remote processes started through a remote shell are unknown to PBS
and can't be killed. It is up to your epilogue to figure out what else
needs to be killed.
Garrick Staples, GNU/Linux HPCC SysAdmin University of Southern
Life is Good!
This email and any attachments to it may be confidential and are
intended solely for the use of the individual to whom it is
addressed. If you are not the intended recipient of this email,
you must neither take any action based upon its contents, nor
copy or show it to anyone. Please contact the sender if you
believe you have received this email in error. QinetiQ may
monitor email traffic data and also the content of email for
the purposes of security. QinetiQ Limited (Registered in England
& Wales: Company Number: 3796233) Registered office: 85
Buckingham Gate, London SW1E 6PD http://www.qinetiq.com.
More information about the torqueusers