[torqueusers] pbsnodes reporting incorrect totmem/availmem

Gianfranco Sciacca gs at hep.ucl.ac.uk
Thu Jun 11 06:46:37 MDT 2009


Hi everyone,

we are having scheduling issues on our cluster, presumably due to  
wrong values of totmem and availmem reported by pbsnodes. Affected are  
nodes with 16GB+16GB swap. Other nodes with up to 4GB+4GB swap seem to  
report more or less consistent values.

I paste below an example with farm00 being the Torque server and  
farm25 the node being probed. I am not sure I can make any sense of  
the numbers reported, except for Maui that seems to get it right. But  
perhaps it's just me not understanding the output of pbsnodes. I can't  
be positive about what values were reported some weeks back, but  
scheduling made sense in that if jobs were submitted to the idle farm,  
the 16+16GB nodes were surely prioritised as execution nodes over the  
less equipped nodes.

I should add that earlier on I have re-booted several of such 16+16GB  
nodes (which installed a new kernel) and the value of availmem is now  
consistently about half of what seen prior to re-boot. The node probed  
below has not been re-booted yet. Torque+Maui RPMs are 32-bit.

Thanks for any enlightenment,
Gianfranco

=====================
[root at farm00 ~]# uname -a; rpm -qa|grep torque; rpm -qa|grep maui;  
pbsnodes farm25; ssh farm25 "uname -a; free -k"; checknode farm25;
Linux farm00 2.6.9-78.0.8.EL.cernsmp #1 SMP Thu Nov 27 15:13:12 CET  
2008 x86_64 x86_64 x86_64 GNU/Linux
torque-2.1.8-1cri_sl4_1st
torque-scheduler-2.1.8-1cri_sl4_1st
torque-client-2.1.8-1cri_sl4_1st
torque-server-2.1.8-1cri_sl4_1st
torque-mom-2.1.8-1cri_sl4_1st
torque-gui-2.1.8-1cri_sl4_1st
torque-docs-2.1.8-1cri_sl4_1st
maui-server-3.2.6p17-1_sl4
maui-3.2.6p17-1_sl4
maui-client-3.2.6p17-1_sl4
farm25
     state = offline
     np = 8
     properties = 64bit
     ntype = cluster
     jobs = 0/1906412.farm00
     status = opsys=linux,uname=Linux farm25 2.6.9-78.0.22.EL.cernsmp  
#1 SMP Mon May 4 17:21:38 CEST 2009  
x86_64 
,sessions 
= 
13153 
,nsessions 
= 
1 
,nusers 
= 
1 
,idletime 
= 
2590913 
,totmem 
= 
3851164kb 
,availmem 
= 
7147176kb 
,physmem 
=16431408kb,ncpus=8,loadave=1.00,netload=4294967294,size=363311544kb: 
363412444kb,state=free,jobs=1906412.farm00,rectime=1244722984

root at farm25's password:
Linux farm25 2.6.9-78.0.22.EL.cernsmp #1 SMP Mon May 4 17:21:38 CEST  
2009 x86_64 x86_64 x86_64 GNU/Linux
             total       used       free     shared    buffers      
cached
Mem:      16431408    2232032   14199376          0     247692     
1099352
-/+ buffers/cache:     884988   15546420
Swap:     16779884      14096   16765788


checking node farm25

State:   Drained  (in current state for 1:52:45)
Configured Resources: PROCS: 8  MEM: 15G  SWAP: 15G  DISK: 346G
Utilized   Resources: PROCS: 8  DISK: 98M
Dedicated  Resources: PROCS: 1  MEM: 1024M
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       1.000
Network:    [DEFAULT]
Features:   [64bit]
Attributes: [Batch]
Classes:    [medium 8:8][long 7:8][short 8:8][parallel 0:8][bulk 0:8] 
[medium64 8:8]

Total Time:   INFINITY  Up:   INFINITY (89.83%)  Active: 83:07:07:13  
(59.82%)

Reservations:
  Job '1906412'(x1)  -2:08:30 -> 3:21:51:30 (4:00:00:00)
JobList:  1906412
ALERT:  jobs active on node but state is Drained
=====================

-- 
Dr. Gianfranco Sciacca			Tel: +44 (0)20 7679 3044
Dept of Physics and Astronomy		Internal: 33044
University College London		D15 - Physics Building
London WC1E 6BT



-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2944 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20090611/977a7e1a/attachment-0001.bin 


More information about the torqueusers mailing list