[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Chris Samuel csamuel at vpac.org
Sat Dec 12 04:37:29 MST 2009


----- "Douglas Needham" <dneedham at cmu.edu> wrote:

> I would like to hear the details on this.  Would
> you be willing to highlight some of the issues at
> least?  

Firstly, as Glen mentioned, a node that goes bad and
reboots under load will drain your queue through the
reboot->accept job->reboot loop. :-(

Secondly we've seen MPI jobs fail where the default
resource limit on the amount of memory that can be
locked causes job initialisation to fail.  For some
reason even inserting a "ulimit -l unlimited" into
the init.d script before it starts the pbs_mom didn't
seem to fix it.

Thirdly, if a node does go bad and reboot then it
makes diagnosis and troubleshooting a lot easier if
the node has no jobs on it.

cheers!
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


More information about the torqueusers mailing list