[torqueusers] NUMA -- A first try

Svancara, Randall rsvancara at wsu.edu
Wed Apr 18 17:29:59 MDT 2012


Just a heads up, the reason NUMA was not being built in the RPM is because the buildutils/torque.spec.in file does not include the %{ac_with_numa} parameter in the ./configure section.  Otherwise a regular build would work fine.

187c184
<     --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} %{ac_with_numa} \
---
>     --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} \

I could exactly figure it out as I am still learning about spec files, but these lines may also need to change:


19c19
< #%bcond_with    blcr
---
> %bcond_with    blcr
25c25
< #%bcond_with    numa
---
> %bcond_with    numa
35,37d34
< %bcond_without bclr
< %bcond_without numa
<

I am not sure if the ./configure overrides these values or not.

Randall Svancara
High Performance Computing Systems Administrator
Washington State University
509-335-3039

From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer
Sent: Wednesday, April 18, 2012 2:51 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] NUMA -- A first try

There are no additional libraries that you need to supply.
On Wed, Apr 18, 2012 at 3:46 PM, Svancara, Randall <rsvancara at wsu.edu<mailto:rsvancara at wsu.edu>> wrote:
Ok, well this gives me starting place at least.

Are there additional libraries I need to supply?

I build the software the following way:

./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp
make rpm

I believe this is the relevant section out of the config.log.

configure:22026: $? = 0
configure:22029: test -s conftest.o
configure:22032: $? = 0
configure:22043: result: yes
configure:22257: checking whether to allow geometry requests
configure:22274: result: no
configure:22285: checking whether to support NUMA systems
configure:22288: result: yes
configure:22313: checking whether to enable libcpuset support
configure:22399: result: no
configure:22407: checking whether to enable memacct support
configure:22416: result: no
configure:22510: checking whether add memory alignment flags
configure:22517: result: no
configure:22604: checking whether to build BLCR support
configure:22606: result: yes
configure:22627: checking for cr_init in -lcr
configure:22657: gcc -o conftest -g -O2 -D_LARGEFILE64_SOURCE -DNUMA_SUPPORT  -L/usr/lib -lcr  conftest.c -lcr   >&5

Randall Svancara
High Performance Computing Systems Administrator
Washington State University
509-335-3039<tel:509-335-3039>

From: torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org>] On Behalf Of David Beer
Sent: Wednesday, April 18, 2012 2:14 PM

To: Torque Users Mailing List
Subject: Re: [torqueusers] NUMA -- A first try

Randall,

After looking closer at your logs, it appears that the pbs_mom binary wasn't numa enabled. If it were, you'd have a message saying:

Setting up this mom to function as %d numa nodes - in your case that %d would be a 2.

or you'd have one of these error messages:

Malformed mom.layout file, line:\n%s\n
Unable to read the layout file in %s

David
On Wed, Apr 18, 2012 at 1:26 PM, Svancara, Randall <rsvancara at wsu.edu<mailto:rsvancara at wsu.edu>> wrote:
Hey, good to know that I did do something correct.  I have validated the mom.layout file is in /var/spool/torque/mom_priv/mom.layout.

  4 -rw-r--r-- 1 root root    185 Apr 17 19:44 config
  4 -rwxr-xr-x 1 root root    708 Apr  5  2011 epilogue
  4 -rwxrwxrwx 1 root root    708 Apr  5  2011 epilogue.sh
  0 drwxr-x--x 2 root root     40 Apr 17 10:33 jobs
  4 -rwxr--r-- 1 root root     31 Apr 17 19:23 mom.layout
  4 -rwxr--r-- 1 root root     50 Apr 17 19:20 mom.layout_bak
  4 -rw-r--r-- 1 root root     32 Apr 17 15:26 mom.layout_old
  4 -rw-r--r-- 1 root root      7 Apr 17 19:45 mom.lock
  4 -rwxr-xr-x 1 root root    527 Apr 26  2011 prologue
  4 -rwxrwxrwx 1 root root    527 Apr  5  2011 prologue.sh
  4 -rwxr-xr-x 1 root root    203 Apr  5  2011 setperms.sh

Thanks,

Randall Svancara
High Performance Computing Systems Administrator
Washington State University
509-335-3039<tel:509-335-3039>

From: torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org>] On Behalf Of David Beer
Sent: Wednesday, April 18, 2012 12:16 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] NUMA -- A first try

Randall,

You did compile numa support. You can know this because you get node11-0 and node11-1 in your pbsnodes output. Is your mom.layout file in the correct location? It should be in mom_priv/mom.layout.

David

On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall <rsvancara at wsu.edu<mailto:rsvancara at wsu.edu>> wrote:
Hi,

I have compiled torque 3.0.4 with NUMA support per this document.

http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml

I have created the server_priv/nodes and mom_priv/mom.layout file

Here are the versions of software:

[root at node11 bin]# pbs_mom -v
version: 3.0.4

[root at mgt1 server_priv]# pbs_server -v
version: 3.0.4

lstopo shows:

[root at node11 bin]# ./lstopo
Machine (24GB)
  NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)
    L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)
    L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
    L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)
    L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)
    L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
  NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)
    L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)
    L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
    L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)
    L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)
    L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)
    L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)

Mom.layout:

cpus=0-5        mem=0
cpus=6-11       mem=1

server_priv/nodes:
node11 num_numa_nodes=2 compute

I restart pbs_server on management node and pbs_mom on node11.
pbsnodes -a  shows

node11-0
     state = down
     np = 0
     properties = compute
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

node11-1
     state = down
     np = 0
     properties = compute
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0


mom_log on node11 has:

04/17/2012 19:45:01;0002;   pbs_mom;Svr;Log;Log opened
04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 0
04/17/2012 19:45:01;0002;   pbs_mom;Svr;setpbsserver;mgt1
04/17/2012 19:45:01;0002;   pbs_mom;Svr;mom_server_add;server mgt1 added
04/17/2012 19:45:01;0002;   pbs_mom;Svr;setremchkptdirlist;added RemChkptDir[0] '/fastscratch/tmp'
04/17/2012 19:45:01;0002;   pbs_mom;Svr;settmpdir;/fastscratch/tmp
04/17/2012 19:45:01;0002;   pbs_mom;Svr;setloglevel;7
04/17/2012 19:45:01;0002;   pbs_mom;Svr;read_config;processing config line '$usecp *:/home /home
04/17/2012 19:45:01;0002;   pbs_mom;Svr;usecp;*:/home /home
04/17/2012 19:45:01;0002;   pbs_mom;Svr;read_config;processing config line '$usecp *:/scratch /scratch
04/17/2012 19:45:01;0002;   pbs_mom;Svr;usecp;*:/scratch /scratch
04/17/2012 19:45:01;0002;   pbs_mom;Svr;read_config;processing config line '$spool_as_final_name true
04/17/2012 19:45:01;0002;   pbs_mom;Svr;spoolasfinalname;true
04/17/2012 19:45:01;0002;   pbs_mom;n/a;initialize;independent
04/17/2012 19:45:01;0002;   pbs_mom;n/a;mom_open_poll;started
04/17/2012 19:45:01;0080;   pbs_mom;Svr;mom_get_sample;proc_array load started
04/17/2012 19:45:01;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=202
04/17/2012 19:45:01;0080;   pbs_mom;Svr;pbs_mom;before init_abort_jobs
04/17/2012 19:45:01;0001;   pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2
04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;Is up
04/17/2012 19:45:01;0002;   pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029
04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 7
04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;checking for old pbs_mom logs in dir '/var/spool/torque/mom_logs' (older than 1 days)
04/17/2012 19:45:01;0002;   pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open RPP conn to mgt1 port 15001
04/17/2012 19:45:01;0002;   pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection to mgt1 port 15001
04/17/2012 19:45:01;0002;   pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1
04/17/2012 19:45:01;0008;   pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello
04/17/2012 19:45:01;0008;   pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1
04/17/2012 19:45:03;0008;   pbs_mom;Job;do_rpp;got an inter-server request
04/17/2012 19:45:03;0001;   pbs_mom;Job;is_request;stream 0 version 2
04/17/2012 19:45:03;0001;   pbs_mom;Job;is_request;command 2, "CLUSTER_ADDRS", received

My problem as illustrated from the pbsnodes command above is that node11 is down.  And running strace on the pbs_mom process does not indicate any access to the mom.layout file?

So did I really compile NUMA support?  I can see references to NUMA in the Makefile for torque and the config.log definitely has the right parameters:

  $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp

Can anyone provide further illumination on my already dark dreary day?

Thanks,

Randall Svancara
High Performance Computing Systems Administrator
Washington State University


_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
David Beer | Software Engineer
Adaptive Computing


_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
David Beer | Software Engineer
Adaptive Computing


_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
David Beer | Software Engineer
Adaptive Computing

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/f41031d7/attachment-0001.html 


More information about the torqueusers mailing list