[torqueusers] Torque on NUMA Systems

David Beer dbeer at adaptivecomputing.com
Mon Sep 16 11:44:53 MDT 2013


The confusion appears to be in the mom.layout file. As of the 4 series of
code you don't have to place cpus and mems in the layout file, it is simply
the node index (as defined by the OS). To translate your file, simply take
the value for mem= and make it nodes=. For example:

cpus=0-7 mem=0
cpus=8-15 mem=1

becomes

nodes=0
nodes=1

and so on.

Espero que isso lhe ajude.

David


On Mon, Sep 16, 2013 at 8:19 AM, Alison Barros <alisonbarros at gmail.com>wrote:

>  Hello Guys,
>
> I have a SGI ICEX Cluster System running torque perfectly and now I'm
> responsible to implement torque on a SGI UV2000 System using NUMA
> configuration on SLES 11, but I'm having some trouble. I hope somebody can
> help me.
>
>  ** Hardware specification:*
>
> The topology command says:
>
>  System type: UV2000
>
> System name: lanina
>
> Serial number: UV2-00000003
>
> Partition number: 0
>
> *48 Blades*
>
> *1536 CPUs*
>
> *96 Nodes*
>
> 2933.29 GB Memory Total
>
> 31.00 GB Max Memory on any Node
>
>  1 BASE I/O Riser
>
>  2 PCIe Slots
>
>  2 Fibre Channel Controllers
>
>  1 InfiniBand Controller
>
>  2 Network Controllers
>
>  2 Storage Controllers
>
>  2 USB Controllers
>
>  1 VGA GPU
>
>
> Despite the command topology saying that there are 1536 CPUs available,
> there are only 768 with Hyper-Thread enabled.
>
>
> ** Compiler step:*
>
>  I created the rpm packages editing the torque.spec and including *
> --enable-numa-support* and *--enable-cpuset* flags
>
>  After installing the packages on the system, I could verify the correct
> flags with pbs_server --about command:
>
>
>  *lanina:~ # pbs_server --about*
>
> Package: torque 4.2.3.1
>
> Sourcedir: /usr/src/packages/BUILD/torque-4.2.3.1
>
> Configure: '--host=x86_64-suse-linux-gnu' '--build=x86_64-suse-linux-gnu'
> '--target=x86_64-suse-linux' '--program-prefix=' '--prefix=/usr'
> '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin'
> '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include'
> '--libdir=/usr/lib64' '--libexecdir=/usr/lib64' '--localstatedir=/var'
> '--sharedstatedir=/usr/com' '--mandir=/usr/share/man'
> '--infodir=/usr/share/info' '--includedir=/usr/include/torque'
> '--with-default-server=lanina' '--with-server-home=/var/spool/torque'
> '--without-debug' 'CFLAGS=-O0 -g3' '--disable-libcpuset'
> '--with-sendmail=/usr/sbin/sendmail'* '--enable-numa-support' *'--enable-memacct'
> '--disable-top-tempdir-only' '--disable-dependency-tracking'
> '--disable-gui' '--without-tcl' '--with-rcp=scp' '--enable-syslog'
> '--disable-gcc-warnings' '--disable-munge-auth' '--without-pam'
> '--disable-drmaa' '--disable-qsub-keep-override' '--disable-blcr' *'--enable-cpuset'
> '--enable-spool' '--with-hwloc-path=/usr/include/hwloc'*'build_alias=x86_64-suse-linux-gnu' 'host_alias=x86_64-suse-linux-gnu'
> 'target_alias=x86_64-suse-linux' 'CXXFLAGS=-O2 -g -m64 -fmessage-length=0
> -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables
> -fasynchronous-unwind-tables'
>
> *Buildcflags: -O0 -g3 -DNUMA_SUPPORT -I/usr/include/hwloc/include*
>
>  ** Configuration:*
>
> Following the manual, i used the /sys/devices/system/node directory to
> create the /var/spool/mom_priv/mom.layout file on the client side:
>
>  cpus=0-7 mem=0
>
> cpus=8-15 mem=1
>
> cpus=16-23 mem=2
>
> cpus=24-31 mem=3
>
> cpus=32-39 mem=4
>
> ...
>
> cpus=744-751 mem=93
>
> cpus=752-759 mem=94
>
> cpus=760-767 mem=95
>
>  On the server side I created the /var/spool/torque/server_priv/nodes
> file with the following content:
>
> *lanina np=768 num_numa_nodes=96*
>
>  ** Results:*
>
>  The log while starting server on debug mode shows:
>
>  09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;Log;Log opened
>
> 09/13/2013 10:55:44;0006;PBS_Server.588556;Svr;PBS_Server;Server
> 'lanina' started, initialization type = 1
>
> 09/13/2013
> 10:55:44;0002;PBS_Server.588556;Svr;get_default_threads;Defaulting
> min_threads to 3073 threads
>
> 09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;Act;Account file
> /var/spool/torque/server_priv/accounting/20130913 opened
>
> 09/13/2013 10:55:44;0040;PBS_Server.588556;Req;setup_nodes;setup_nodes()
>
> 09/13/2013 10:55:44;0040;PBS_Server.588556;Req;setup_nodes;could not
> create node "lanina", error = 15002
>
> 09/13/2013 10:55:44;0086;PBS_Server.588556;Svr;PBS_Server;Recovered queue
> batch
>
> 09/13/2013 10:55:44;0086;PBS_Server.588556;Svr;PBS_Server;Recovered queue
> pesquisa
>
> 09/13/2013 10:55:44;0086;PBS_Server.588556;Svr;PBS_Server;Recovered queue
> operacional
>
> 09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;PBS_Server;Expected 3,
> recovered 3 queues
>
> 09/13/2013 10:55:44;0080;PBS_Server.588556;Svr;PBS_Server;2 total files
> read from disk
>
> 09/13/2013
> 10:55:44;0002;PBS_Server.588556;Svr;PBS_Server;handle_job_recovery:3
>
> 09/13/2013 10:55:44;0006;PBS_Server.588556;Svr;PBS_Server;Using ports
> Server:15001 Scheduler:15004 MOM:15002 (server: 'lanina')
>
> 09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;PBS_Server;Server Ready,
> pid = 588556, loglevel=0
>
> 09/13/2013 10:55:55;0002;PBS_Server.588560;Svr;PBS_Server;Torque Server
> Version = 4.2.3.1, loglevel = 0
>
>
>  The log while starting client on debug mode shows:
>
>   09/13/2013 10:55:49;0002; pbs_mom.588562;Svr;Log;Log opened
>
> 09/13/2013 10:55:49;0002; pbs_mom.588562;Svr;pbs_mom;Torque Mom Version =
> 4.2.3.1, loglevel = 0
>
> 09/13/2013 10:55:50;0002;
> pbs_mom.588562;Svr;setup_program_environment;machine topology contains 96
> memory nodes, 1536 cpus
>
> 09/13/2013 10:55:50;0001;
> pbs_mom.588562;Svr;pbs_mom;LOG_ERROR::read_layout_file, nodeboard 0 has no
> nodeset
>
> 09/13/2013 10:55:50;0001;
> pbs_mom.588562;Svr;pbs_mom;LOG_ERROR::setup_nodeboards, Could not read
> layout file!
>
>
>  It seems like the torque isn't able to find the mom.layout file, but
> starting it using strace program I can see torque client opening and
> reading the file.
>
>  Any help? Thanks in advance.
>
> --
> Att.
> MSc. Alison Barros da Silva
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130916/2bb92bc7/attachment.html 


More information about the torqueusers mailing list