[torqueusers] Retrieving node loadave

Neil Hodgson neil.hodgson at sirca.org.au
Sun Mar 8 20:23:50 MDT 2009


    To avoid some problems with using the resource manager in a site-specific 
scheduler, it would be better to retrieve the node load average through a 
different mechanism. It appears that loadave is always reported in the node 
"status" attribute (ATTR_NODE_status) returned by pbs_statnode and visible in 
pbsnodes -a:

opsys=linux,uname=Linux castle-dev106.sirca.org.au 2.6.18-92.el5PAE #1 SMP Tue 
Jun 10 19:22:41 EDT 2008 
i686,sessions=11601,nsessions=1,nusers=1,idletime=244033,totmem=8342120kb,availmem=8192616kb,physmem=4147824kb,ncpus=8,loadave=0.00,netload=898153080,state=free,jobs=,varattr=,rectime=1236564365

    I couldn't find any documentation that specifies what "status" contains or 
how it is generated. From reading the code and experimentation, the loadave 
portion comes from a platform-specific get_la function but may be overridden by 
an entry in the mom_priv/config file like:

loadave           !/bin/awk '{print $2}' /proc/loadavg

    If the command in the config fails (as occurred when I copied the above 
command onto an OS X machine) then status will not include loadave.

    Is my above understanding correct and can I rely on loadave being present 
inside "status" unless there is a bad configuration?

    Background:

    The scheduler uses the resource manager calls (openrm, addreq, getreq, 
closerm) to retrieve static and dynamic resources from the node's 
mom_priv/config file. This occasionally fails and the failures occur more often 
when the network is busy. The network is known to have a high failure rate for 
UDP - syslogs are forwarded to a central monitor over UDP and there are often 
missing entries. After examining the failures and the code I believe (but have 
no real proof) that RPP is not completely reliable when packets are dropped as 
it gets stuck and then times out.

    Neil


More information about the torqueusers mailing list