[torqueusers] NUMA -- A first try

Svancara, Randall rsvancara at wsu.edu
Wed Apr 18 13:26:55 MDT 2012


Hey, good to know that I did do something correct.  I have validated the mom.layout file is in /var/spool/torque/mom_priv/mom.layout.

  4 -rw-r--r-- 1 root root    185 Apr 17 19:44 config
  4 -rwxr-xr-x 1 root root    708 Apr  5  2011 epilogue
  4 -rwxrwxrwx 1 root root    708 Apr  5  2011 epilogue.sh
  0 drwxr-x--x 2 root root     40 Apr 17 10:33 jobs
  4 -rwxr--r-- 1 root root     31 Apr 17 19:23 mom.layout
  4 -rwxr--r-- 1 root root     50 Apr 17 19:20 mom.layout_bak
  4 -rw-r--r-- 1 root root     32 Apr 17 15:26 mom.layout_old
  4 -rw-r--r-- 1 root root      7 Apr 17 19:45 mom.lock
  4 -rwxr-xr-x 1 root root    527 Apr 26  2011 prologue
  4 -rwxrwxrwx 1 root root    527 Apr  5  2011 prologue.sh
  4 -rwxr-xr-x 1 root root    203 Apr  5  2011 setperms.sh

Thanks,

Randall Svancara
High Performance Computing Systems Administrator
Washington State University
509-335-3039

From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of David Beer
Sent: Wednesday, April 18, 2012 12:16 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] NUMA -- A first try

Randall,

You did compile numa support. You can know this because you get node11-0 and node11-1 in your pbsnodes output. Is your mom.layout file in the correct location? It should be in mom_priv/mom.layout.

David

On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall <rsvancara at wsu.edu<mailto:rsvancara at wsu.edu>> wrote:
Hi,

I have compiled torque 3.0.4 with NUMA support per this document.

http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml

I have created the server_priv/nodes and mom_priv/mom.layout file

Here are the versions of software:

[root at node11 bin]# pbs_mom -v
version: 3.0.4

[root at mgt1 server_priv]# pbs_server -v
version: 3.0.4

lstopo shows:

[root at node11 bin]# ./lstopo
Machine (24GB)
  NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)
    L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)
    L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
    L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)
    L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)
    L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
  NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)
    L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)
    L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
    L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)
    L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)
    L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)
    L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)

Mom.layout:

cpus=0-5        mem=0
cpus=6-11       mem=1

server_priv/nodes:
node11 num_numa_nodes=2 compute

I restart pbs_server on management node and pbs_mom on node11.
pbsnodes -a  shows

node11-0
     state = down
     np = 0
     properties = compute
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

node11-1
     state = down
     np = 0
     properties = compute
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0


mom_log on node11 has:

04/17/2012 19:45:01;0002;   pbs_mom;Svr;Log;Log opened
04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 0
04/17/2012 19:45:01;0002;   pbs_mom;Svr;setpbsserver;mgt1
04/17/2012 19:45:01;0002;   pbs_mom;Svr;mom_server_add;server mgt1 added
04/17/2012 19:45:01;0002;   pbs_mom;Svr;setremchkptdirlist;added RemChkptDir[0] '/fastscratch/tmp'
04/17/2012 19:45:01;0002;   pbs_mom;Svr;settmpdir;/fastscratch/tmp
04/17/2012 19:45:01;0002;   pbs_mom;Svr;setloglevel;7
04/17/2012 19:45:01;0002;   pbs_mom;Svr;read_config;processing config line '$usecp *:/home /home
04/17/2012 19:45:01;0002;   pbs_mom;Svr;usecp;*:/home /home
04/17/2012 19:45:01;0002;   pbs_mom;Svr;read_config;processing config line '$usecp *:/scratch /scratch
04/17/2012 19:45:01;0002;   pbs_mom;Svr;usecp;*:/scratch /scratch
04/17/2012 19:45:01;0002;   pbs_mom;Svr;read_config;processing config line '$spool_as_final_name true
04/17/2012 19:45:01;0002;   pbs_mom;Svr;spoolasfinalname;true
04/17/2012 19:45:01;0002;   pbs_mom;n/a;initialize;independent
04/17/2012 19:45:01;0002;   pbs_mom;n/a;mom_open_poll;started
04/17/2012 19:45:01;0080;   pbs_mom;Svr;mom_get_sample;proc_array load started
04/17/2012 19:45:01;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded - nproc=202
04/17/2012 19:45:01;0080;   pbs_mom;Svr;pbs_mom;before init_abort_jobs
04/17/2012 19:45:01;0001;   pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2
04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;Is up
04/17/2012 19:45:01;0002;   pbs_mom;Svr;setup_program_environment;MOM executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029
04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.4, loglevel = 7
04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;checking for old pbs_mom logs in dir '/var/spool/torque/mom_logs' (older than 1 days)
04/17/2012 19:45:01;0002;   pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open RPP conn to mgt1 port 15001
04/17/2012 19:45:01;0002;   pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection to mgt1 port 15001
04/17/2012 19:45:01;0002;   pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1
04/17/2012 19:45:01;0008;   pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello
04/17/2012 19:45:01;0008;   pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1
04/17/2012 19:45:03;0008;   pbs_mom;Job;do_rpp;got an inter-server request
04/17/2012 19:45:03;0001;   pbs_mom;Job;is_request;stream 0 version 2
04/17/2012 19:45:03;0001;   pbs_mom;Job;is_request;command 2, "CLUSTER_ADDRS", received

My problem as illustrated from the pbsnodes command above is that node11 is down.  And running strace on the pbs_mom process does not indicate any access to the mom.layout file?

So did I really compile NUMA support?  I can see references to NUMA in the Makefile for torque and the config.log definitely has the right parameters:

  $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support --disable-gui --enable-blcr --with-default-server=mgt1 --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp

Can anyone provide further illumination on my already dark dreary day?

Thanks,

Randall Svancara
High Performance Computing Systems Administrator
Washington State University


_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers



--
David Beer | Software Engineer
Adaptive Computing

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/6cdca52d/attachment-0001.html 


More information about the torqueusers mailing list