[torqueusers] NUMA -- A first try
Svancara, Randall
rsvancara at wsu.edu
Wed Apr 18 09:44:33 MDT 2012
Hi,
I have compiled torque 3.0.4 with NUMA support per this document.
http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml
I have created the server_priv/nodes and mom_priv/mom.layout file
Here are the versions of software:
[root at node11 bin]# pbs_mom -v
version: 3.0.4
[root at mgt1 server_priv]# pbs_server -v
version: 3.0.4
lstopo shows:
[root at node11 bin]# ./lstopo
Machine (24GB)
NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)
L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)
L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)
Mom.layout:
cpus=0-5 mem=0
cpus=6-11 mem=1
server_priv/nodes:
node11 num_numa_nodes=2 compute
I restart pbs_server on management node and pbs_mom on node11.
pbsnodes -a shows
node11-0
state = down
np = 0
properties = compute
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
node11-1
state = down
np = 0
properties = compute
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
mom_log on node11 has:
04/17/2012 19:45:01;0002; pbs_mom;Svr;Log;Log opened
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 0
04/17/2012 19:45:01;0002; pbs_mom;Svr;setpbsserver;mgt1
04/17/2012 19:45:01;0002; pbs_mom;Svr;mom_server_add;server mgt1 added
04/17/2012 19:45:01;0002; pbs_mom;Svr;setremchkptdirlist;added RemChkptDir[0] '/fastscratch/tmp'
04/17/2012 19:45:01;0002; pbs_mom;Svr;settmpdir;/fastscratch/tmp
04/17/2012 19:45:01;0002; pbs_mom;Svr;setloglevel;7
04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/home /home
04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/home /home
04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$usecp *:/scratch /scratch
04/17/2012 19:45:01;0002; pbs_mom;Svr;usecp;*:/scratch /scratch
04/17/2012 19:45:01;0002; pbs_mom;Svr;read_config;processing config line '$spool_as_final_name true
04/17/2012 19:45:01;0002; pbs_mom;Svr;spoolasfinalname;true
04/17/2012 19:45:01;0002; pbs_mom;n/a;initialize;independent
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_open_poll;started
04/17/2012 19:45:01;0080; pbs_mom;Svr;mom_get_sample;proc_array load started
04/17/2012 19:45:01;0080; pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=202
04/17/2012 19:45:01;0080; pbs_mom;Svr;pbs_mom;before init_abort_jobs
04/17/2012 19:45:01;0001; pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Is up
04/17/2012 19:45:01;0002; pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 7
04/17/2012 19:45:01;0002; pbs_mom;Svr;pbs_mom;checking for old pbs_mom logs in dir '/var/spool/torque/mom_logs' (older than 1 days)
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open RPP conn to mgt1 port 15001
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection to mgt1 port 15001
04/17/2012 19:45:01;0002; pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1
04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello
04/17/2012 19:45:01;0008; pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1
04/17/2012 19:45:03;0008; pbs_mom;Job;do_rpp;got an inter-server request
04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;stream 0 version 2
04/17/2012 19:45:03;0001; pbs_mom;Job;is_request;command 2, "CLUSTER_ADDRS", received
My problem as illustrated from the pbsnodes command above is that node11 is down. And running strace on the pbs_mom process does not indicate any access to the mom.layout file?
So did I really compile NUMA support? I can see references to NUMA in the Makefile for torque and the config.log definitely has the right parameters:
$ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp
Can anyone provide further illumination on my already dark dreary day?
Thanks,
Randall Svancara
High Performance Computing Systems Administrator
Washington State University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/36167409/attachment-0001.html
More information about the torqueusers
mailing list