[torqueusers] Retrieving node loadave
Neil Hodgson
neil.hodgson at sirca.org.au
Sun Mar 8 20:23:50 MDT 2009
To avoid some problems with using the resource manager in a site-specific
scheduler, it would be better to retrieve the node load average through a
different mechanism. It appears that loadave is always reported in the node
"status" attribute (ATTR_NODE_status) returned by pbs_statnode and visible in
pbsnodes -a:
opsys=linux,uname=Linux castle-dev106.sirca.org.au 2.6.18-92.el5PAE #1 SMP Tue
Jun 10 19:22:41 EDT 2008
i686,sessions=11601,nsessions=1,nusers=1,idletime=244033,totmem=8342120kb,availmem=8192616kb,physmem=4147824kb,ncpus=8,loadave=0.00,netload=898153080,state=free,jobs=,varattr=,rectime=1236564365
I couldn't find any documentation that specifies what "status" contains or
how it is generated. From reading the code and experimentation, the loadave
portion comes from a platform-specific get_la function but may be overridden by
an entry in the mom_priv/config file like:
loadave !/bin/awk '{print $2}' /proc/loadavg
If the command in the config fails (as occurred when I copied the above
command onto an OS X machine) then status will not include loadave.
Is my above understanding correct and can I rely on loadave being present
inside "status" unless there is a bad configuration?
Background:
The scheduler uses the resource manager calls (openrm, addreq, getreq,
closerm) to retrieve static and dynamic resources from the node's
mom_priv/config file. This occasionally fails and the failures occur more often
when the network is busy. The network is known to have a high failure rate for
UDP - syslogs are forwarded to a central monitor over UDP and there are often
missing entries. After examining the failures and the code I believe (but have
no real proof) that RPP is not completely reliable when packets are dropped as
it gets stuck and then times out.
Neil
More information about the torqueusers
mailing list