[torqueusers] Retrieving node loadave

Neil Hodgson neil.hodgson at sirca.org.au
Sun Mar 8 20:23:50 MDT 2009

    To avoid some problems with using the resource manager in a site-specific 
scheduler, it would be better to retrieve the node load average through a 
different mechanism. It appears that loadave is always reported in the node 
"status" attribute (ATTR_NODE_status) returned by pbs_statnode and visible in 
pbsnodes -a:

opsys=linux,uname=Linux castle-dev106.sirca.org.au 2.6.18-92.el5PAE #1 SMP Tue 
Jun 10 19:22:41 EDT 2008 

    I couldn't find any documentation that specifies what "status" contains or 
how it is generated. From reading the code and experimentation, the loadave 
portion comes from a platform-specific get_la function but may be overridden by 
an entry in the mom_priv/config file like:

loadave           !/bin/awk '{print $2}' /proc/loadavg

    If the command in the config fails (as occurred when I copied the above 
command onto an OS X machine) then status will not include loadave.

    Is my above understanding correct and can I rely on loadave being present 
inside "status" unless there is a bad configuration?


    The scheduler uses the resource manager calls (openrm, addreq, getreq, 
closerm) to retrieve static and dynamic resources from the node's 
mom_priv/config file. This occasionally fails and the failures occur more often 
when the network is busy. The network is known to have a high failure rate for 
UDP - syslogs are forwarded to a central monitor over UDP and there are often 
missing entries. After examining the failures and the code I believe (but have 
no real proof) that RPP is not completely reliable when packets are dropped as 
it gets stuck and then times out.


