[torqueusers] Torque on NUMA Systems

Alison Barros alisonbarros at gmail.com
Mon Sep 16 08:19:22 MDT 2013


Hello Guys,

I have a SGI ICEX Cluster System running torque perfectly and now I'm
responsible to implement torque on a SGI UV2000 System using NUMA
configuration on SLES 11, but I'm having some trouble. I hope somebody can
help me.

 ** Hardware specification:*

The topology command says:

 System type: UV2000

System name: lanina

Serial number: UV2-00000003

Partition number: 0

*48 Blades*

*1536 CPUs*

*96 Nodes*

2933.29 GB Memory Total

31.00 GB Max Memory on any Node

 1 BASE I/O Riser

 2 PCIe Slots

 2 Fibre Channel Controllers

 1 InfiniBand Controller

 2 Network Controllers

 2 Storage Controllers

 2 USB Controllers

 1 VGA GPU


Despite the command topology saying that there are 1536 CPUs available,
there are only 768 with Hyper-Thread enabled.


** Compiler step:*

 I created the rpm packages editing the torque.spec and including *
--enable-numa-support* and *--enable-cpuset* flags

 After installing the packages on the system, I could verify the correct
flags with pbs_server --about command:


 *lanina:~ # pbs_server --about*

Package: torque 4.2.3.1

Sourcedir: /usr/src/packages/BUILD/torque-4.2.3.1

Configure: '--host=x86_64-suse-linux-gnu' '--build=x86_64-suse-linux-gnu'
'--target=x86_64-suse-linux' '--program-prefix=' '--prefix=/usr'
'--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin'
'--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include'
'--libdir=/usr/lib64' '--libexecdir=/usr/lib64' '--localstatedir=/var'
'--sharedstatedir=/usr/com' '--mandir=/usr/share/man'
'--infodir=/usr/share/info' '--includedir=/usr/include/torque'
'--with-default-server=lanina' '--with-server-home=/var/spool/torque'
'--without-debug' 'CFLAGS=-O0 -g3' '--disable-libcpuset'
'--with-sendmail=/usr/sbin/sendmail'* '--enable-numa-support'
*'--enable-memacct'
'--disable-top-tempdir-only' '--disable-dependency-tracking'
'--disable-gui' '--without-tcl' '--with-rcp=scp' '--enable-syslog'
'--disable-gcc-warnings' '--disable-munge-auth' '--without-pam'
'--disable-drmaa' '--disable-qsub-keep-override' '--disable-blcr'
*'--enable-cpuset'
'--enable-spool'
'--with-hwloc-path=/usr/include/hwloc'*'build_alias=x86_64-suse-linux-gnu'
'host_alias=x86_64-suse-linux-gnu'
'target_alias=x86_64-suse-linux' 'CXXFLAGS=-O2 -g -m64 -fmessage-length=0
-D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables
-fasynchronous-unwind-tables'

*Buildcflags: -O0 -g3 -DNUMA_SUPPORT -I/usr/include/hwloc/include*

 ** Configuration:*

Following the manual, i used the /sys/devices/system/node directory to
create the /var/spool/mom_priv/mom.layout file on the client side:

 cpus=0-7 mem=0

cpus=8-15 mem=1

cpus=16-23 mem=2

cpus=24-31 mem=3

cpus=32-39 mem=4

...

cpus=744-751 mem=93

cpus=752-759 mem=94

cpus=760-767 mem=95

 On the server side I created the /var/spool/torque/server_priv/nodes file
with the following content:

*lanina np=768 num_numa_nodes=96*

 ** Results:*

 The log while starting server on debug mode shows:

 09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;Log;Log opened

09/13/2013 10:55:44;0006;PBS_Server.588556;Svr;PBS_Server;Server
'lanina' started, initialization type = 1

09/13/2013
10:55:44;0002;PBS_Server.588556;Svr;get_default_threads;Defaulting
min_threads to 3073 threads

09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;Act;Account file
/var/spool/torque/server_priv/accounting/20130913 opened

09/13/2013 10:55:44;0040;PBS_Server.588556;Req;setup_nodes;setup_nodes()

09/13/2013 10:55:44;0040;PBS_Server.588556;Req;setup_nodes;could not create
node "lanina", error = 15002

09/13/2013 10:55:44;0086;PBS_Server.588556;Svr;PBS_Server;Recovered queue
batch

09/13/2013 10:55:44;0086;PBS_Server.588556;Svr;PBS_Server;Recovered queue
pesquisa

09/13/2013 10:55:44;0086;PBS_Server.588556;Svr;PBS_Server;Recovered queue
operacional

09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;PBS_Server;Expected 3,
recovered 3 queues

09/13/2013 10:55:44;0080;PBS_Server.588556;Svr;PBS_Server;2 total files
read from disk

09/13/2013
10:55:44;0002;PBS_Server.588556;Svr;PBS_Server;handle_job_recovery:3

09/13/2013 10:55:44;0006;PBS_Server.588556;Svr;PBS_Server;Using ports
Server:15001 Scheduler:15004 MOM:15002 (server: 'lanina')

09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;PBS_Server;Server Ready, pid
= 588556, loglevel=0

09/13/2013 10:55:55;0002;PBS_Server.588560;Svr;PBS_Server;Torque Server
Version = 4.2.3.1, loglevel = 0


 The log while starting client on debug mode shows:

  09/13/2013 10:55:49;0002; pbs_mom.588562;Svr;Log;Log opened

09/13/2013 10:55:49;0002; pbs_mom.588562;Svr;pbs_mom;Torque Mom Version =
4.2.3.1, loglevel = 0

09/13/2013 10:55:50;0002;
pbs_mom.588562;Svr;setup_program_environment;machine topology contains 96
memory nodes, 1536 cpus

09/13/2013 10:55:50;0001;
pbs_mom.588562;Svr;pbs_mom;LOG_ERROR::read_layout_file, nodeboard 0 has no
nodeset

09/13/2013 10:55:50;0001;
pbs_mom.588562;Svr;pbs_mom;LOG_ERROR::setup_nodeboards, Could not read
layout file!


 It seems like the torque isn't able to find the mom.layout file, but
starting it using strace program I can see torque client opening and
reading the file.

 Any help? Thanks in advance.

-- 
Att.
MSc. Alison Barros da Silva
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130916/77681392/attachment-0001.html 


More information about the torqueusers mailing list