[torqueusers] pbsnodes reporting incorrect "availmem"?

Riccardo Murri rmurri at cscs.ch
Tue Jun 2 04:20:59 MDT 2009


We just noticed that our cluster is only running a fraction of the
jobs that it could be running.  We traced it down to MAUI being
convinced that the worker nodes have much less virtual memory than
they actually have.  This in turn depends on "pbsnodes" reporting a
strange "availmem" value:

  $ pbsnodes -a
       state = free
       np = 16
       properties = lcgpro
       ntype = cluster
       jobs = [...]
  e01.lcg.cscs.ch, 4/1699839.ce01.lcg.cscs.ch
       status = opsys=linux,uname=Linux wn03 2.6.9-78.0.22.ELhugemem #1 SMP Fri May 1 00:50:13 CDT 2009 i686,[...],nsessions=5,nusers=2,idletime=336,totmem=43746664kb,availmem=14258792kb,physmem=33264260kb,ncpus=16,loadave=5.01,netload=3445635636,state=free,jobs=[...],varattr=,rectime=1243937534

The availmem=14258792kb has no apparent relation with what system utilities like
"free" display:

  $ ssh wn03 free -k                                                                           
               total       used       free     shared    buffers     cached
  Mem:      33264260   32314932     949328          0      17892    3503884
  -/+ buffers/cache:   28793156    4471104
  Swap:     10482404     729668    9752736

Output from "ps" and "pmap" utilities is consistent with what "free"
is displaying.

How does pbs_mom compute the "availmem" value?  What could be wrong here?

We're using torque 2.3.0 from the gLite distribution on SL4 nodes::

  $ ssh wn01 rpm -qa | fgrep torque      

Thank you very much for any suggestion!

Best regards,

Riccardo Murri
CSCS - Swiss National Centre for Supercomputing
Galleria 2, via Cantonale
CH-6928 Manno (Switzerland)

tel.: +41 91 610 8234
Fax: +41 91 610 8282

More information about the torqueusers mailing list