[torqueusers] Two configs for Warewulf NHC 1.2.2 ?

Grigory Shamov gas5x at yahoo.com
Tue Mar 26 08:17:13 MDT 2013


Dear Michael,

We actually have similar problem of check_ps  checks being slow.  I thought of having two instances of NHC, one fast for Torque, that would report "ERROR", offline nodes etc. but not involving the slower tests, and another running as a cron job, thus not blocking PBS_MOMS and allowing for the slow checks, perhaps with a larger timeout, just to check for runaways and unauthorized users.

However, on a first glance it is not  possible? As the CONFFILE gets read in /etc/sysconfig/nhc only.  Which is system wide.  Can there be a way to specify different configs for NHC to run somehow (command line)?

--
Grigory Shamov
University of Manitoba

--- On Thu, 1/24/13, Michael Jennings <mej at lbl.gov> wrote:

> From: Michael Jennings <mej at lbl.gov>
> Subject: Re: [torqueusers] Warewulf NHC 1.2.2 Release
> To: torqueusers at supercluster.org
> Date: Thursday, January 24, 2013, 12:51 PM
> On Thursday, 24 January 2013, at
> 15:26:14 (-0500),
> Matt Britt wrote:
> 
> > Thanks Michael - that got me pointed in the right
> direction.  We're
> > just using /etc/passwd, and it should be up to
> date.  The function
> > using the time was 'check_ps_daemon sshd root':
> > 
> > [root at nyx5506 msbritt]# time nhc     
>   (with check_ps_daemon)
> > 
> > 
> > 
> > 
> > 
> > 
> > real    0m5.785s
> > user    0m5.565s
> > sys     0m0.101s
> > [root at nyx5506 msbritt]# !vim
> > vim /etc/nhc/nhc.conf
> > [root at nyx5506 msbritt]# time nhc  (without
> check_ps_daemon)
> > 
> > real    0m0.185s
> > user    0m0.109s
> > sys     0m0.055s
> 
> Wow, that's quite a difference.  :-)
> 
> Is that the only check_ps_* check in your
> configuration?  I'm guessing
> it is based on the time delay.
> 
> What happens is this:  the first time you use one of
> the process-based
> checks, NHC will run the "ps" command to gather information
> on all
> your system processes.  This can, as you're seeing,
> take quite a bit
> of time on a heavily-loaded compute node.  However, it
> only needs to
> do this once; if you use one ps-based check, you can use as
> many as
> you want because you've already "taken the hit" of the
> subprocess
> overhead.  Subsequent checks will used the cached data
> instead of
> launching "ps" again.
> 
> Glad you found the culprit!  NHC tries to be as
> efficient as possible
> in everything it does, but it's up to each site to determine
> how they
> want to balance the tradeoffs between longer/shorter
> execution time
> for NHC and more/less comprehensive assessments of node
> health.  I
> tried to make it as easy as possible to measure and evaluate
> those
> tradeoffs; hopefully I succeeded.  :-)
> 
> Michael
> 
> -- 
> Michael Jennings <mej at lbl.gov>
> Senior HPC Systems Engineer
> High-Performance Computing Services
> Lawrence Berkeley National Laboratory
> Bldg 50B-3209E        W: 510-495-2687
> MS 050B-3209          F:
> 510-486-8615
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list