[torqueusers] NUMA -- A first try
Svancara, Randall
rsvancara at wsu.edu
Wed Apr 18 13:26:55 MDT 2012
Hey, good to know that I did do something correct. I have validated the mom.layout file is in /var/spool/torque/mom_priv/mom.layout.
4 -rw-r--r-- 1 root root 185 Apr 17 19:44 config
4 -rwxr-xr-x 1 root root 708 Apr 5 2011 epilogue
4 -rwxrwxrwx 1 root root 708 Apr 5 2011 epilogue.sh
0 drwxr-x--x 2 root root 40 Apr 17 10:33 jobs
4 -rwxr--r-- 1 root root 31 Apr 17 19:23 mom.layout
4 -rwxr--r-- 1 root root 50 Apr 17 19:20 mom.layout_bak
4 -rw-r--r-- 1 root root 32 Apr 17 15:26 mom.layout_old
4 -rw-r--r-- 1 root root 7 Apr 17 19:45 mom.lock
4 -rwxr-xr-x 1 root root 527 Apr 26 2011 prologue
4 -rwxrwxrwx 1 root root 527 Apr 5 2011 prologue.sh
4 -rwxr-xr-x 1 root root 203 Apr 5 2011 setperms.sh
Thanks,
Randall Svancara
High Performance Computing Systems Administrator
Washington State University
509-335-3039
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer
Sent: Wednesday, April 18, 2012 12:16 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] NUMA -- A first try
Randall,
You did compile numa support. You can know this because you get node11-0 and node11-1 in your pbsnodes output. Is your mom.layout file in the correct location? It should be in mom_priv/mom.layout.
David
On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall <rsvancara at wsu.edu<mailto:rsvancara at wsu.edu>> wrote:
Hi,
I have compiled torque 3.0.4 with NUMA support per this document.
http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml
I have created the server_priv/nodes and mom_priv/mom.layout file
Here are the versions of software:
[root at node11 bin]# pbs_mom -v
version: 3.0.4
[root at mgt1 server_priv]# pbs_server -v
version: 3.0.4
lstopo shows:
[root at node11 bin]# ./lstopo
Machine (24GB)
NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)
L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)
L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)
Mom.layout:
cpus=0-5 mem=0
cpus=6-11 mem=1
server_priv/nodes:
node11 num_numa_nodes=2 compute
I restart pbs_server on management node and pbs_mom on node11.
pbsnodes -a shows
node11-0
state = down
np = 0
properties = compute
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
node11-1
state = down
np = 0
properties = compute
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
mom_log on node11 has:
04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 0
04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1
04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added
04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added RemChkptDir[0] '/fastscratch/tmp'
04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp
04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7
04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/home /home
04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home
04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/scratch /scratch
04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch
04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$spool_as_final_name true
04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true
04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started
04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load started
04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=202
04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up
04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 7
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom logs in dir '/var/spool/torque/mom_logs' (older than 1 days)
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open RPP conn to mgt1 port 15001
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection to mgt1 port 15001
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1
04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello
04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1
04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request
04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2
04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, "CLUSTER_ADDRS", received
My problem as illustrated from the pbsnodes command above is that node11 is down. And running strace on the pbs_mom process does not indicate any access to the mom.layout file?
So did I really compile NUMA support? I can see references to NUMA in the Makefile for torque and the config.log definitely has the right parameters:
$ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp
Can anyone provide further illumination on my already dark dreary day?
Thanks,
Randall Svancara
High Performance Computing Systems Administrator
Washington State University
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers
--
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/6cdca52d/attachment-0001.html
More information about the torqueusers
mailing list