[torqueusers] NUMA -- A first try

David Beer dbeer at adaptivecomputing.com
Wed Apr 18 15:50:37 MDT 2012


There are no additional libraries that you need to supply.

On Wed, Apr 18, 2012 at 3:46 PM, Svancara, Randall <rsvancara at wsu.edu>wrote:

>  Ok, well this gives me starting place at least.  ****
>
> ** **
>
> Are there additional libraries I need to supply?
>
> I build the software the following way:****
>
> ** **
>
> ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support
> --disable-gui --enable-blcr --with-default-server=mgt1
> --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp****
>
> make rpm****
>
> ** **
>
> I believe this is the relevant section out of the config.log.****
>
> ** **
>
> configure:22026: $? = 0****
>
> configure:22029: test -s conftest.o****
>
> configure:22032: $? = 0****
>
> configure:22043: result: yes****
>
> configure:22257: checking whether to allow geometry requests****
>
> configure:22274: result: no****
>
> configure:22285: checking whether to support NUMA systems****
>
> configure:22288: result: yes****
>
> configure:22313: checking whether to enable libcpuset support****
>
> configure:22399: result: no****
>
> configure:22407: checking whether to enable memacct support****
>
> configure:22416: result: no****
>
> configure:22510: checking whether add memory alignment flags****
>
> configure:22517: result: no****
>
> configure:22604: checking whether to build BLCR support****
>
> configure:22606: result: yes****
>
> configure:22627: checking for cr_init in -lcr****
>
> configure:22657: gcc -o conftest -g -O2 -D_LARGEFILE64_SOURCE
> -DNUMA_SUPPORT  -L/usr/lib -lcr  conftest.c -lcr   >&5****
>
> ** **
>
> Randall Svancara****
>
> High Performance Computing Systems Administrator****
>
> Washington State University****
>
> 509-335-3039****
>
> ** **
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer
> *Sent:* Wednesday, April 18, 2012 2:14 PM
>
> *To:* Torque Users Mailing List
> *Subject:* Re: [torqueusers] NUMA -- A first try****
>
> ** **
>
> Randall,****
>
> ** **
>
> After looking closer at your logs, it appears that the pbs_mom binary
> wasn't numa enabled. If it were, you'd have a message saying:****
>
> ** **
>
> Setting up this mom to function as %d numa nodes - in your case that %d
> would be a 2.****
>
> ** **
>
> or you'd have one of these error messages:****
>
> ** **
>
> Malformed mom.layout file, line:\n%s\n****
>
> Unable to read the layout file in %s****
>
> ** **
>
> David****
>
> On Wed, Apr 18, 2012 at 1:26 PM, Svancara, Randall <rsvancara at wsu.edu>
> wrote:****
>
> Hey, good to know that I did do something correct.  I have validated the
> mom.layout file is in /var/spool/torque/mom_priv/mom.layout.****
>
>  ****
>
>   4 -rw-r--r-- 1 root root    185 Apr 17 19:44 config****
>
>   4 -rwxr-xr-x 1 root root    708 Apr  5  2011 epilogue****
>
>   4 -rwxrwxrwx 1 root root    708 Apr  5  2011 epilogue.sh****
>
>   0 drwxr-x--x 2 root root     40 Apr 17 10:33 jobs****
>
>   4 -rwxr--r-- 1 root root     31 Apr 17 19:23 mom.layout****
>
>   4 -rwxr--r-- 1 root root     50 Apr 17 19:20 mom.layout_bak****
>
>   4 -rw-r--r-- 1 root root     32 Apr 17 15:26 mom.layout_old****
>
>   4 -rw-r--r-- 1 root root      7 Apr 17 19:45 mom.lock****
>
>   4 -rwxr-xr-x 1 root root    527 Apr 26  2011 prologue****
>
>   4 -rwxrwxrwx 1 root root    527 Apr  5  2011 prologue.sh****
>
>   4 -rwxr-xr-x 1 root root    203 Apr  5  2011 setperms.sh****
>
>  ****
>
> Thanks,****
>
>  ****
>
> Randall Svancara****
>
> High Performance Computing Systems Administrator****
>
> Washington State University****
>
> 509-335-3039****
>
>  ****
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *David Beer
> *Sent:* Wednesday, April 18, 2012 12:16 PM
> *To:* Torque Users Mailing List
> *Subject:* Re: [torqueusers] NUMA -- A first try****
>
>  ****
>
> Randall,****
>
>  ****
>
> You did compile numa support. You can know this because you get node11-0
> and node11-1 in your pbsnodes output. Is your mom.layout file in the
> correct location? It should be in mom_priv/mom.layout.****
>
>  ****
>
> David****
>
>  ****
>
> On Wed, Apr 18, 2012 at 9:44 AM, Svancara, Randall <rsvancara at wsu.edu>
> wrote:****
>
> Hi,****
>
>  ****
>
> I have compiled torque 3.0.4 with NUMA support per this document.****
>
>  ****
>
> http://www.clusterresources.com/torquedocs21/1.7torqueonnuma.shtml****
>
>  ****
>
> I have created the server_priv/nodes and mom_priv/mom.layout file****
>
>  ****
>
> Here are the versions of software:****
>
>  ****
>
> [root at node11 bin]# pbs_mom -v****
>
> version: 3.0.4****
>
>  ****
>
> [root at mgt1 server_priv]# pbs_server -v****
>
> version: 3.0.4****
>
>  ****
>
> lstopo shows:****
>
>  ****
>
> [root at node11 bin]# ./lstopo****
>
> Machine (24GB)****
>
>   NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)****
>
>     L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)****
>
>     L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)****
>
>     L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)****
>
>     L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)****
>
>     L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)****
>
>     L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)****
>
>   NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)****
>
>     L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)****
>
>     L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)****
>
>     L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)****
>
>     L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)****
>
>     L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)****
>
>     L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)****
>
>  ****
>
> Mom.layout:****
>
>  ****
>
> cpus=0-5        mem=0****
>
> cpus=6-11       mem=1****
>
>  ****
>
> server_priv/nodes:****
>
> node11 num_numa_nodes=2 compute****
>
>  ****
>
> I restart pbs_server on management node and pbs_mom on node11.****
>
> pbsnodes –a  shows
>
> node11-0****
>
>      state = down****
>
>      np = 0****
>
>      properties = compute****
>
>      ntype = cluster****
>
>      mom_service_port = 15002****
>
>      mom_manager_port = 15003****
>
>      gpus = 0****
>
>  ****
>
> node11-1****
>
>      state = down****
>
>      np = 0****
>
>      properties = compute****
>
>      ntype = cluster****
>
>      mom_service_port = 15002****
>
>      mom_manager_port = 15003****
>
>      gpus = 0****
>
>  ****
>
>  ****
>
> mom_log on node11 has:****
>
>  ****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;Log;Log opened****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 3.0.4, loglevel = 0****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;setpbsserver;mgt1****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;mom_server_add;server mgt1 added**
> **
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;setremchkptdirlist;added
> RemChkptDir[0] '/fastscratch/tmp'****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;settmpdir;/fastscratch/tmp****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;setloglevel;7****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;read_config;processing config line
> '$usecp *:/home /home****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;usecp;*:/home /home****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;read_config;processing config line
> '$usecp *:/scratch /scratch****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;usecp;*:/scratch /scratch****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;read_config;processing config line
> '$spool_as_final_name true****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;spoolasfinalname;true****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;n/a;initialize;independent****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;n/a;mom_open_poll;started****
>
> 04/17/2012 19:45:01;0080;   pbs_mom;Svr;mom_get_sample;proc_array load
> started****
>
> 04/17/2012 19:45:01;0080;   pbs_mom;n/a;mom_get_sample;proc_array loaded -
> nproc=202****
>
> 04/17/2012 19:45:01;0080;   pbs_mom;Svr;pbs_mom;before init_abort_jobs****
>
> 04/17/2012 19:45:01;0001;   pbs_mom;Svr;pbs_mom;init_abort_jobs: recover=2
> ****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;Is up****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;setup_program_environment;MOM
> executable path and mtime at launch: /usr/sbin/pbs_mom 1334684029****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version =
> 3.0.4, loglevel = 7****
>
> 04/17/2012 19:45:01;0002;   pbs_mom;Svr;pbs_mom;checking for old pbs_mom
> logs in dir '/var/spool/torque/mom_logs' (older than 1 days)****
>
> 04/17/2012 19:45:01;0002;
> pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: trying to open
> RPP conn to mgt1 port 15001****
>
> 04/17/2012 19:45:01;0002;
> pbs_mom;n/a;mom_server_open_stream;mom_server_open_stream: added connection
> to mgt1 port 15001****
>
> 04/17/2012 19:45:01;0002;
> pbs_mom;n/a;mom_server_check_connection;sending hello to server mgt1****
>
> 04/17/2012 19:45:01;0008;
> pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello****
>
> 04/17/2012 19:45:01;0008;
> pbs_mom;Svr;mom_server_send_hello;mom_server_send_hello done. Sent count = 1
> ****
>
> 04/17/2012 19:45:03;0008;   pbs_mom;Job;do_rpp;got an inter-server request
> ****
>
> 04/17/2012 19:45:03;0001;   pbs_mom;Job;is_request;stream 0 version 2****
>
> 04/17/2012 19:45:03;0001;   pbs_mom;Job;is_request;command 2,
> "CLUSTER_ADDRS", received****
>
>  ****
>
> My problem as illustrated from the pbsnodes command above is that node11
> is down.  And running strace on the pbs_mom process does not indicate any
> access to the mom.layout file?
>
> So did I really compile NUMA support?  I can see references to NUMA in the
> Makefile for torque and the config.log definitely has the right parameters:
> ****
>
>  ****
>
>   $ ./configure --prefix=/usr --with-blcr=/usr --enable-numa-support
> --disable-gui --enable-blcr --with-default-server=mgt1
> --with-servchkptdir=/fastscratch/tmp --with-tmpdir=/fastscratch/tmp****
>
>  ****
>
> Can anyone provide further illumination on my already dark dreary day?****
>
>  ****
>
> Thanks,****
>
>  ****
>
> Randall Svancara****
>
> High Performance Computing Systems Administrator****
>
> Washington State University****
>
>  ****
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers****
>
>
>
> ****
>
>  ****
>
> -- ****
>
> David Beer | Software Engineer****
>
> Adaptive Computing****
>
>  ****
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers****
>
>
>
> ****
>
> ** **
>
> -- ****
>
> David Beer | Software Engineer****
>
> Adaptive Computing****
>
> ** **
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120418/14766b04/attachment-0001.html 


More information about the torqueusers mailing list