[torqueusers] NUMA -- A first try
Svancara, Randall
rsvancara at wsu.edu
Thu Apr 19 09:23:16 MDT 2012
No worries, I am just happy that I could figure out a problem and contribute to such a great application. If there is something I could do to help file bugs, let me know.
Thanks,
Randall Svancara
High Performance Computing Systems Administrator
Washington State University
509-335-3039
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer
Sent: Thursday, April 19, 2012 8:16 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] NUMA -- A first try
Randall,
I apologize for the problem. We will log a bug internally and make sure this is corrected.
David
On Wed, Apr 18, 2012 at 5:29 PM, Svancara, Randall <rsvancara at wsu.edu<mailto:rsvancara at wsu.edu>> wrote:
Just a heads up, the reason NUMA was not being built in the RPM is because the buildutils/torque.spec.in<http://torque.spec.in> file does not include the %{ac_with_numa} parameter in the ./configure section. Otherwise a regular build would work fine.
187c184
< --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} %{ac_with_numa} \
---
> --with-server-home=%{torque_home} --with-sendmail=%{sendmail_path} \
I could exactly figure it out as I am still learning about spec files, but these lines may also need to change:
19c19
< #%bcond_with blcr
---
> %bcond_with blcr
25c25
< #%bcond_with numa
---
> %bcond_with numa
35,37d34
< %bcond_without bclr
< %bcond_without numa
<
I am not sure if the ./configure overrides these values or not.
Randall Svancara
High Performance Computing Systems Administrator
Washington State University
509-335-3039<tel:509-335-3039>
From: torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org>] On Behalf Of David Beer
Sent: Wednesday, April 18, 2012 2:51 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] NUMA -- A first try
There are no additional libraries that you need to supply.
On Wed, Apr 18, 2012 at 3:46 PM, Svancara, Randall <rsvancara at wsu.edu<mailto:rsvancara at wsu.edu>> wrote:
Ok, well this gives me starting place at least.
Are there additional libraries I need to supply?
I build the software the following way:
./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp
make rpm
I believe this is the relevant section out of the config.log.
configure:22026: $? = 0
configure:22029: test -s conftest.o
configure:22032: $? = 0
configure:22043: result: yes
configure:22257: checking whether to allow geometry requests
configure:22274: result: no
configure:22285: checking whether to support NUMA systems
configure:22288: result: yes
configure:22313: checking whether to enable libcpuset support
configure:22399: result: no
configure:22407: checking whether to enable memacct support
configure:22416: result: no
configure:22510: checking whether add memory alignment flags
configure:22517: result: no
configure:22604: checking whether to build BLCR support
configure:22606: result: yes
configure:22627: checking for cr_init in -lcr
configure:22657: gcc -o conftest -g -O2 -D_LARGEFILE64_SOURCE -DNUMA_SUPPORT -L/usr/lib -lcr conftest.c -lcr >&5
Randall Svancara
High Performance Computing Systems Administrator
Washington State University
509-335-3039<tel:509-335-3039>
From: torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org>] On Behalf Of David Beer
Sent: Wednesday, April 18, 2012 2:14 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] NUMA -- A first try
Randall,
After looking closer at your logs, it appears that the pbs_mom binary wasn't numa enabled. If it were, you'd have a message saying:
Setting up this mom to function as %d numa nodes - in your case that %d would be a 2.
or you'd have one of these error messages:
Malformed mom.layout file, line:\n%s\n
Unable to read the layout file in %s
David
On Wed, Apr 18, 2012 at 1:26 PM, Svancara, Randall <rsvancara at wsu.edu<mailto:rsvancara at wsu.edu>> wrote:
Hey, good to know that I did do something correct. I have validated the mom.layout file is in /var/spool/torque/mom_priv/mom.layout.
4 -rw-r--r-- 1 root root 185 Apr 17 19:44 config
4 -rwxr-xr-x 1 root root 708 Apr 5 2011 epilogue
4 -rwxrwxrwx 1 root root 708 Apr 5 2011 epilogue.sh
0 drwxr-x--x 2 root root 40 Apr 17 10:33 jobs
4 -rwxr--r-- 1 root root 31 Apr 17 19:23 mom.layout
4 -rwxr--r-- 1 root root 50 Apr 17 19:20 mom.layout_bak
4 -rw-r--r-- 1 root root 32 Apr 17 15:26 mom.layout_old
4 -rw-r--r-- 1 root root 7 Apr 17 19:45 mom.lock
4 -rwxr-xr-x 1 root root 527 Apr 26 2011 prologue
4 -rwxrwxrwx 1 root root 527 Apr 5 2011 prologue.sh
4 -rwxr-xr-x 1 root root 203 Apr 5 2011 setperms.sh
Thanks,
Randall Svancara
High Performance Computing Systems Administrator
Washington State University
509-335-3039<tel:509-335-3039>
From: torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org> [mailto:torqueusers-bounces at supercluster.org<mailto:torqueusers-bounces at supercluster.org>] On Behalf Of David Beer
Sent: Wednesday, April 18, 2012 12:16 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] NUMA -- A first try
Randall,
You did compile numa support. You can know this because you get node11-0 and node11-1 in your pbsnodes output. Is your mom.layout file in the correct location? It should be in mom_priv/mom.layout.
David
On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall <rsvancara at wsu.edu<mailto:rsvancara at wsu.edu>> wrote:
Hi,
I have compiled torque 3.0.4 with NUMA support per this document.
http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml
I have created the server_priv/nodes and mom_priv/mom.layout file
Here are the versions of software:
[root at node11 bin]# pbs_mom -v
version: 3.0.4
[root at mgt1 server_priv]# pbs_server -v
version: 3.0.4
lstopo shows:
[root at node11 bin]# ./lstopo
Machine (24GB)
NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)
L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)
L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)
Mom.layout:
cpus=0-5 mem=0
cpus=6-11 mem=1
server_priv/nodes:
node11 num_numa_nodes=2 compute
I restart pbs_server on management node and pbs_mom on node11.
pbsnodes -a shows
node11-0
state = down
np = 0
properties = compute
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
node11-1
state = down
np = 0
properties = compute
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
mom_log on node11 has:
04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 0
04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1
04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added
04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added RemChkptDir[0] '/fastscratch/tmp'
04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp
04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7
04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/home /home
04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home
04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/scratch /scratch
04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch
04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$spool_as_final_name true
04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true
04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started
04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load started
04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=202
04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up
04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 7
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom logs in dir '/var/spool/torque/mom_logs' (older than 1 days)
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open RPP conn to mgt1 port 15001
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection to mgt1 port 15001
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1
04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello
04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1
04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request
04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2
04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, "CLUSTER_ADDRS", received
My problem as illustrated from the pbsnodes command above is that node11 is down. And running strace on the pbs_mom process does not indicate any access to the mom.layout file?
So did I really compile NUMA support? I can see references to NUMA in the Makefile for torque and the config.log definitely has the right parameters:
$ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp
Can anyone provide further illumination on my already dark dreary day?
Thanks,
Randall Svancara
High Performance Computing Systems Administrator
Washington State University
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Software Engineer
Adaptive Computing
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120419/bb46ec53/attachment-0001.html
More information about the torqueusers
mailing list