[torqueusers] dead mom deamon

Chris Samuel csamuel at vpac.org
Thu Jun 12 21:15:12 MDT 2008


----- "Nicolas Ferré" <nicolas.ferre at univ-provence.fr> wrote:

> Hi,
> 
> Sometimes, the mom daemon ( 2.3.1-snap.200804241117) dies without any
> notice (from what I can see in the log file). In the server log, I can
> see:
>
> What shall I do to diagnose the problem ?

What do the mom_logs say on the compute node, and is there anything
in the syslogs on that node too ?

Also - are you using cpuset support ?    If so there is a
known file descriptor leak which will break the mom after
a certain number of jobs that have been run through it. It's
fixed in SVN and is in the current (June) snapshot:

torque-2.3.1-snap.200806121700.tar.gz
 
> Another thing: when the mom daemon is dead, jobs already running
> continue to run. However, if I restart the mom daemon, they are killed
> immediately, and placed in queue as if they never run before. How can
> I insure that running jobs continue to run when the mom daemon is
> restarted ?

You want to start the pbs_mom with the -p option to preserve
existing jobs.  We always do this.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


More information about the torqueusers mailing list