[torqueusers] pbsnodes reporting incorrect totmem/availmem
Gianfranco Sciacca
gs at hep.ucl.ac.uk
Thu Jun 11 06:46:37 MDT 2009
Hi everyone,
we are having scheduling issues on our cluster, presumably due to
wrong values of totmem and availmem reported by pbsnodes. Affected are
nodes with 16GB+16GB swap. Other nodes with up to 4GB+4GB swap seem to
report more or less consistent values.
I paste below an example with farm00 being the Torque server and
farm25 the node being probed. I am not sure I can make any sense of
the numbers reported, except for Maui that seems to get it right. But
perhaps it's just me not understanding the output of pbsnodes. I can't
be positive about what values were reported some weeks back, but
scheduling made sense in that if jobs were submitted to the idle farm,
the 16+16GB nodes were surely prioritised as execution nodes over the
less equipped nodes.
I should add that earlier on I have re-booted several of such 16+16GB
nodes (which installed a new kernel) and the value of availmem is now
consistently about half of what seen prior to re-boot. The node probed
below has not been re-booted yet. Torque+Maui RPMs are 32-bit.
Thanks for any enlightenment,
Gianfranco
=====================
[root at farm00 ~]# uname -a; rpm -qa|grep torque; rpm -qa|grep maui;
pbsnodes farm25; ssh farm25 "uname -a; free -k"; checknode farm25;
Linux farm00 2.6.9-78.0.8.EL.cernsmp #1 SMP Thu Nov 27 15:13:12 CET
2008 x86_64 x86_64 x86_64 GNU/Linux
torque-2.1.8-1cri_sl4_1st
torque-scheduler-2.1.8-1cri_sl4_1st
torque-client-2.1.8-1cri_sl4_1st
torque-server-2.1.8-1cri_sl4_1st
torque-mom-2.1.8-1cri_sl4_1st
torque-gui-2.1.8-1cri_sl4_1st
torque-docs-2.1.8-1cri_sl4_1st
maui-server-3.2.6p17-1_sl4
maui-3.2.6p17-1_sl4
maui-client-3.2.6p17-1_sl4
farm25
state = offline
np = 8
properties = 64bit
ntype = cluster
jobs = 0/1906412.farm00
status = opsys=linux,uname=Linux farm25 2.6.9-78.0.22.EL.cernsmp
#1 SMP Mon May 4 17:21:38 CEST 2009
x86_64
,sessions
=
13153
,nsessions
=
1
,nusers
=
1
,idletime
=
2590913
,totmem
=
3851164kb
,availmem
=
7147176kb
,physmem
=16431408kb,ncpus=8,loadave=1.00,netload=4294967294,size=363311544kb:
363412444kb,state=free,jobs=1906412.farm00,rectime=1244722984
root at farm25's password:
Linux farm25 2.6.9-78.0.22.EL.cernsmp #1 SMP Mon May 4 17:21:38 CEST
2009 x86_64 x86_64 x86_64 GNU/Linux
total used free shared buffers
cached
Mem: 16431408 2232032 14199376 0 247692
1099352
-/+ buffers/cache: 884988 15546420
Swap: 16779884 14096 16765788
checking node farm25
State: Drained (in current state for 1:52:45)
Configured Resources: PROCS: 8 MEM: 15G SWAP: 15G DISK: 346G
Utilized Resources: PROCS: 8 DISK: 98M
Dedicated Resources: PROCS: 1 MEM: 1024M
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 1.000
Network: [DEFAULT]
Features: [64bit]
Attributes: [Batch]
Classes: [medium 8:8][long 7:8][short 8:8][parallel 0:8][bulk 0:8]
[medium64 8:8]
Total Time: INFINITY Up: INFINITY (89.83%) Active: 83:07:07:13
(59.82%)
Reservations:
Job '1906412'(x1) -2:08:30 -> 3:21:51:30 (4:00:00:00)
JobList: 1906412
ALERT: jobs active on node but state is Drained
=====================
--
Dr. Gianfranco Sciacca Tel: +44 (0)20 7679 3044
Dept of Physics and Astronomy Internal: 33044
University College London D15 - Physics Building
London WC1E 6BT
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2944 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20090611/977a7e1a/attachment-0001.bin
More information about the torqueusers
mailing list