[torqueusers] pbsnodes reporting incorrect totmem/availmem
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Thu Jun 11 06:59:18 MDT 2009
Gianfranco Sciacca wrote:
> we are having scheduling issues on our cluster, presumably due to
> wrong values of totmem and availmem reported by pbsnodes. Affected are
> nodes with 16GB+16GB swap. Other nodes with up to 4GB+4GB swap seem to
> report more or less consistent values.
> I paste below an example with farm00 being the Torque server and
> farm25 the node being probed. I am not sure I can make any sense of
> the numbers reported, except for Maui that seems to get it right. But
> perhaps it's just me not understanding the output of pbsnodes. I can't
> be positive about what values were reported some weeks back, but
> scheduling made sense in that if jobs were submitted to the idle farm,
> the 16+16GB nodes were surely prioritised as execution nodes over the
> less equipped nodes.
Some comments that may or may not be useful to you:
1. Your Torque version 2.1.8 is quite old, we use 2.1.11.
2. "pbsnodes -a" gives correct and consistent values for physical
memory and available memory (=physical+swap) in our cluster.
We have nodes with 8-24 GB RAM and 12-16 GB swap, and we run
CentOS 4 and 5 nodes.
3. May I recommend my script "pestat" for giving a quick overview
of the nodes' load and memory usage, based on parsing the output
of "pbsnodes -a" ? Download ftp://ftp.fysik.dtu.dk/pub/Torque/pestat.
A sample output shows pmem (physical memory) and mem (physical
memory + swap) and other interesting stuff for job and cluster
node state load pmem ncpu mem resi usrs tasks jobids/users
m035 down* 0.00 7990 4 23992 148 0/0 0
m036 free 0.00 7990 4 23992 170 0/0 0
m037 free 0.00 7990 4 23992 146 0/0 0
m038 excl 4.00 7990 4 23992 1269 1/1 4 179 dulak
m039 down* 0.00 0 0 0 0 0/0 0
m040 excl 3.99 7990 4 23992 1172 1/1 4 178 dulak
a001 free 0.00 3940 8 33937 269 0/0 0
Ole Holm Nielsen
Department of Physics, Technical University of Denmark
More information about the torqueusers