[torqueusers] Warewulf NHC 1.2.2 Release

Michael Jennings mej at lbl.gov
Thu Jan 24 13:51:22 MST 2013

On Thursday, 24 January 2013, at 15:26:14 (-0500),
Matt Britt wrote:

> Thanks Michael - that got me pointed in the right direction.  We're
> just using /etc/passwd, and it should be up to date.  The function
> using the time was 'check_ps_daemon sshd root':
> [root at nyx5506 msbritt]# time nhc        (with check_ps_daemon)
> real    0m5.785s
> user    0m5.565s
> sys     0m0.101s
> [root at nyx5506 msbritt]# !vim
> vim /etc/nhc/nhc.conf
> [root at nyx5506 msbritt]# time nhc  (without check_ps_daemon)
> real    0m0.185s
> user    0m0.109s
> sys     0m0.055s

Wow, that's quite a difference.  :-)

Is that the only check_ps_* check in your configuration?  I'm guessing
it is based on the time delay.

What happens is this:  the first time you use one of the process-based
checks, NHC will run the "ps" command to gather information on all
your system processes.  This can, as you're seeing, take quite a bit
of time on a heavily-loaded compute node.  However, it only needs to
do this once; if you use one ps-based check, you can use as many as
you want because you've already "taken the hit" of the subprocess
overhead.  Subsequent checks will used the cached data instead of
launching "ps" again.

Glad you found the culprit!  NHC tries to be as efficient as possible
in everything it does, but it's up to each site to determine how they
want to balance the tradeoffs between longer/shorter execution time
for NHC and more/less comprehensive assessments of node health.  I
tried to make it as easy as possible to measure and evaluate those
tradeoffs; hopefully I succeeded.  :-)


Michael Jennings <mej at lbl.gov>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615

More information about the torqueusers mailing list