[torquedev] TORQUE 2.2.0 Defaults

Dave Jackson jacksond at clusterresources.com
Fri Aug 17 12:21:28 MDT 2007


Garrick,

  Can we ignore the jobs and simply have pbs_mom always touch the pid
file once it starts and then, upon subsequent starts, determine if the
pid file was touched more recently than the OS start time?  If not, then
pbs_mom is starting for the first time after a reboot.  This should be
simple, safe, and applicable to all archs.

  Thoughts?

Dave

On Thu, 2007-08-16 at 16:58 -0700, Garrick Staples wrote:
> On Fri, Aug 17, 2007 at 12:52:19AM +0100, Craig Macdonald alleged:
> > Garrick et al,
> > 
> > >How does pbs_mom know the process is gone?  It can't check the pids because
> > >they might be reused by new processes after the boot.
> > 
> > "Current system time" - "job walltime" vs "start-time of PID"
> > 
> > 
> > All(?) Unix machines seem to provide the start time of a process, so
> > if the process time of a job is unexpectedly low, then the node must
> > have rebooted and the original job is dead.
> > 
> > My only question is 
> > (a) is this figure reset when exec() is called
> > and
> > (b) if the system clock changes unexpectedly over the lifetime of the job
> > followed by the pbs_mom being restarted with recovery enabled. Not sure if
> > a Daylight saving time change would be a problem in this scenario.
> 
> That's possible, but would have to be implemented individually for each mom
> arch (look at the different directories in src/resmom).
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev



More information about the torquedev mailing list