[torqueusers] dead mom deamon
Chris Samuel
csamuel at vpac.org
Thu Jun 12 21:15:12 MDT 2008
----- "Nicolas Ferré" <nicolas.ferre at univ-provence.fr> wrote:
> Hi,
>
> Sometimes, the mom daemon ( 2.3.1-snap.200804241117) dies without any
> notice (from what I can see in the log file). In the server log, I can
> see:
>
> What shall I do to diagnose the problem ?
What do the mom_logs say on the compute node, and is there anything
in the syslogs on that node too ?
Also - are you using cpuset support ? If so there is a
known file descriptor leak which will break the mom after
a certain number of jobs that have been run through it. It's
fixed in SVN and is in the current (June) snapshot:
torque-2.3.1-snap.200806121700.tar.gz
> Another thing: when the mom daemon is dead, jobs already running
> continue to run. However, if I restart the mom daemon, they are killed
> immediately, and placed in queue as if they never run before. How can
> I insure that running jobs continue to run when the mom daemon is
> restarted ?
You want to start the pbs_mom with the -p option to preserve
existing jobs. We always do this.
cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
More information about the torqueusers
mailing list