[torqueusers] Warewulf NHC 1.2.2 Release

Matt Britt msbritt at umich.edu
Thu Jan 24 13:56:18 MST 2013


It is the only check_ps we're using, but after your explanation, I'm going
to stick more in :)

Thanks again,
 - Matt



On Thu, Jan 24, 2013 at 3:51 PM, Michael Jennings <mej at lbl.gov> wrote:

> On Thursday, 24 January 2013, at 15:26:14 (-0500),
> Matt Britt wrote:
>
> > Thanks Michael - that got me pointed in the right direction.  We're
> > just using /etc/passwd, and it should be up to date.  The function
> > using the time was 'check_ps_daemon sshd root':
> >
> > [root at nyx5506 msbritt]# time nhc        (with check_ps_daemon)
> >
> >
> >
> >
> >
> >
> > real    0m5.785s
> > user    0m5.565s
> > sys     0m0.101s
> > [root at nyx5506 msbritt]# !vim
> > vim /etc/nhc/nhc.conf
> > [root at nyx5506 msbritt]# time nhc  (without check_ps_daemon)
> >
> > real    0m0.185s
> > user    0m0.109s
> > sys     0m0.055s
>
> Wow, that's quite a difference.  :-)
>
> Is that the only check_ps_* check in your configuration?  I'm guessing
> it is based on the time delay.
>
> What happens is this:  the first time you use one of the process-based
> checks, NHC will run the "ps" command to gather information on all
> your system processes.  This can, as you're seeing, take quite a bit
> of time on a heavily-loaded compute node.  However, it only needs to
> do this once; if you use one ps-based check, you can use as many as
> you want because you've already "taken the hit" of the subprocess
> overhead.  Subsequent checks will used the cached data instead of
> launching "ps" again.
>
> Glad you found the culprit!  NHC tries to be as efficient as possible
> in everything it does, but it's up to each site to determine how they
> want to balance the tradeoffs between longer/shorter execution time
> for NHC and more/less comprehensive assessments of node health.  I
> tried to make it as easy as possible to measure and evaluate those
> tradeoffs; hopefully I succeeded.  :-)
>
> Michael
>
> --
> Michael Jennings <mej at lbl.gov>
> Senior HPC Systems Engineer
> High-Performance Computing Services
> Lawrence Berkeley National Laboratory
> Bldg 50B-3209E        W: 510-495-2687
> MS 050B-3209          F: 510-486-8615
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130124/ed736365/attachment.html 


More information about the torqueusers mailing list