[torqueusers] Nodes that pbs reports are busy which are actually running a job

Rushton Martin JMRUSHTON at qinetiq.com
Thu Aug 12 03:56:18 MDT 2010


Be careful about assuming that one user = one job.  When our new cluster
was delivered someone had configured the epilogue to kill off all
processes belonging to the user, but with 8 or 16 cores per node we were
caught when one user had several jobs running.  The first job to finish
killed off the user's other jobs.

Martin Rushton
Weapons Technologies
Tel: 01959 514777, Mobile: 07939 219057
email: jmrushton at QinetiQ.com
www.QinetiQ.com
QinetiQ - Delivering customer-focused solutions

Please consider the environment before printing this email.
-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Garrick
Staples
Sent: 11 August 2010 23:13
To: Torque Users Mailing List
Subject: Re: [torqueusers] Nodes that pbs reports are busy which are
actually running a job

On Wed, Aug 11, 2010 at 04:59:07PM -0500, Rahul Nabar alleged:
> On Wed, Aug 11, 2010 at 4:53 PM, Garrick Staples <garrick at usc.edu>
wrote:
> >
> > Nope, it doesn't have a job. What you have are stale processes from
an old job.
> 
> Thanks! I killed them, Does PBS cleanup processes after a job ends 
> automatically? Or is there a suitable flag? These are non-shared nodes

> so no risk of stepping on another jobs processes. All 8 cores are 
> always assigned to same user.
> 
> If not is it a OK fix to put a pkill in the epilogue for all normal 
> usernames. Any caveats? Or better ideas?

It will kill processes that it knows about. This includes any children
of the batch script and any processes launched through the TM interface.
Any remote processes started through a remote shell are unknown to PBS
and can't be killed. It is up to your epilogue to figure out what else
needs to be killed.

--
Garrick Staples, GNU/Linux HPCC SysAdmin University of Southern
California

Life is Good!
This email and any attachments to it may be confidential and are
intended solely for the use of the individual to whom it is 
addressed. If you are not the intended recipient of this email,
you must neither take any action based upon its contents, nor 
copy or show it to anyone. Please contact the sender if you 
believe you have received this email in error. QinetiQ may 
monitor email traffic data and also the content of email for 
the purposes of security. QinetiQ Limited (Registered in England
& Wales: Company Number: 3796233) Registered office: 85 
Buckingham Gate, London SW1E 6PD http://www.qinetiq.com.


More information about the torqueusers mailing list