[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted
Chris Samuel
csamuel at vpac.org
Sat Dec 12 04:37:29 MST 2009
----- "Douglas Needham" <dneedham at cmu.edu> wrote:
> I would like to hear the details on this. Would
> you be willing to highlight some of the issues at
> least?
Firstly, as Glen mentioned, a node that goes bad and
reboots under load will drain your queue through the
reboot->accept job->reboot loop. :-(
Secondly we've seen MPI jobs fail where the default
resource limit on the amount of memory that can be
locked causes job initialisation to fail. For some
reason even inserting a "ulimit -l unlimited" into
the init.d script before it starts the pbs_mom didn't
seem to fix it.
Thirdly, if a node does go bad and reboot then it
makes diagnosis and troubleshooting a lot easier if
the node has no jobs on it.
cheers!
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
More information about the torqueusers
mailing list