[torqueusers] Torque on NUMA Systems

Alison Barros alisonbarros at gmail.com
Mon Sep 16 12:43:31 MDT 2013


It works!

Thank you very much.


2013/9/16 David Beer <dbeer at adaptivecomputing.com>

> The confusion appears to be in the mom.layout file. As of the 4 series of
> code you don't have to place cpus and mems in the layout file, it is simply
> the node index (as defined by the OS). To translate your file, simply take
> the value for mem= and make it nodes=. For example:
>
> cpus=0-7 mem=0
> cpus=8-15 mem=1
>
> becomes
>
> nodes=0
> nodes=1
>
> and so on.
>
> Espero que isso lhe ajude.
>
> David
>
>
> On Mon, Sep 16, 2013 at 8:19 AM, Alison Barros <alisonbarros at gmail.com>wrote:
>
>>  Hello Guys,
>>
>> I have a SGI ICEX Cluster System running torque perfectly and now I'm
>> responsible to implement torque on a SGI UV2000 System using NUMA
>> configuration on SLES 11, but I'm having some trouble. I hope somebody can
>> help me.
>>
>>  ** Hardware specification:*
>>
>> The topology command says:
>>
>>  System type: UV2000
>>
>> System name: lanina
>>
>> Serial number: UV2-00000003
>>
>> Partition number: 0
>>
>> *48 Blades*
>>
>> *1536 CPUs*
>>
>> *96 Nodes*
>>
>> 2933.29 GB Memory Total
>>
>> 31.00 GB Max Memory on any Node
>>
>>  1 BASE I/O Riser
>>
>>  2 PCIe Slots
>>
>>  2 Fibre Channel Controllers
>>
>>  1 InfiniBand Controller
>>
>>  2 Network Controllers
>>
>>  2 Storage Controllers
>>
>>  2 USB Controllers
>>
>>  1 VGA GPU
>>
>>
>> Despite the command topology saying that there are 1536 CPUs available,
>> there are only 768 with Hyper-Thread enabled.
>>
>>
>> ** Compiler step:*
>>
>>  I created the rpm packages editing the torque.spec and including *
>> --enable-numa-support* and *--enable-cpuset* flags
>>
>>  After installing the packages on the system, I could verify the correct
>> flags with pbs_server --about command:
>>
>>
>>  *lanina:~ # pbs_server --about*
>>
>> Package: torque 4.2.3.1
>>
>> Sourcedir: /usr/src/packages/BUILD/torque-4.2.3.1
>>
>> Configure: '--host=x86_64-suse-linux-gnu' '--build=x86_64-suse-linux-gnu'
>> '--target=x86_64-suse-linux' '--program-prefix=' '--prefix=/usr'
>> '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin'
>> '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include'
>> '--libdir=/usr/lib64' '--libexecdir=/usr/lib64' '--localstatedir=/var'
>> '--sharedstatedir=/usr/com' '--mandir=/usr/share/man'
>> '--infodir=/usr/share/info' '--includedir=/usr/include/torque'
>> '--with-default-server=lanina' '--with-server-home=/var/spool/torque'
>> '--without-debug' 'CFLAGS=-O0 -g3' '--disable-libcpuset'
>> '--with-sendmail=/usr/sbin/sendmail'* '--enable-numa-support' *'--enable-memacct'
>> '--disable-top-tempdir-only' '--disable-dependency-tracking'
>> '--disable-gui' '--without-tcl' '--with-rcp=scp' '--enable-syslog'
>> '--disable-gcc-warnings' '--disable-munge-auth' '--without-pam'
>> '--disable-drmaa' '--disable-qsub-keep-override' '--disable-blcr' *'--enable-cpuset'
>> '--enable-spool' '--with-hwloc-path=/usr/include/hwloc'*'build_alias=x86_64-suse-linux-gnu' 'host_alias=x86_64-suse-linux-gnu'
>> 'target_alias=x86_64-suse-linux' 'CXXFLAGS=-O2 -g -m64 -fmessage-length=0
>> -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables
>> -fasynchronous-unwind-tables'
>>
>> *Buildcflags: -O0 -g3 -DNUMA_SUPPORT -I/usr/include/hwloc/include*
>>
>>  ** Configuration:*
>>
>> Following the manual, i used the /sys/devices/system/node directory to
>> create the /var/spool/mom_priv/mom.layout file on the client side:
>>
>>  cpus=0-7 mem=0
>>
>> cpus=8-15 mem=1
>>
>> cpus=16-23 mem=2
>>
>> cpus=24-31 mem=3
>>
>> cpus=32-39 mem=4
>>
>> ...
>>
>> cpus=744-751 mem=93
>>
>> cpus=752-759 mem=94
>>
>> cpus=760-767 mem=95
>>
>>  On the server side I created the /var/spool/torque/server_priv/nodes
>> file with the following content:
>>
>> *lanina np=768 num_numa_nodes=96*
>>
>>  ** Results:*
>>
>>  The log while starting server on debug mode shows:
>>
>>  09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;Log;Log opened
>>
>> 09/13/2013 10:55:44;0006;PBS_Server.588556;Svr;PBS_Server;Server
>> 'lanina' started, initialization type = 1
>>
>> 09/13/2013
>> 10:55:44;0002;PBS_Server.588556;Svr;get_default_threads;Defaulting
>> min_threads to 3073 threads
>>
>> 09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;Act;Account file
>> /var/spool/torque/server_priv/accounting/20130913 opened
>>
>> 09/13/2013 10:55:44;0040;PBS_Server.588556;Req;setup_nodes;setup_nodes()
>>
>> 09/13/2013 10:55:44;0040;PBS_Server.588556;Req;setup_nodes;could not
>> create node "lanina", error = 15002
>>
>> 09/13/2013 10:55:44;0086;PBS_Server.588556;Svr;PBS_Server;Recovered queue
>> batch
>>
>> 09/13/2013 10:55:44;0086;PBS_Server.588556;Svr;PBS_Server;Recovered queue
>> pesquisa
>>
>> 09/13/2013 10:55:44;0086;PBS_Server.588556;Svr;PBS_Server;Recovered queue
>> operacional
>>
>> 09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;PBS_Server;Expected 3,
>> recovered 3 queues
>>
>> 09/13/2013 10:55:44;0080;PBS_Server.588556;Svr;PBS_Server;2 total files
>> read from disk
>>
>> 09/13/2013
>> 10:55:44;0002;PBS_Server.588556;Svr;PBS_Server;handle_job_recovery:3
>>
>> 09/13/2013 10:55:44;0006;PBS_Server.588556;Svr;PBS_Server;Using ports
>> Server:15001 Scheduler:15004 MOM:15002 (server: 'lanina')
>>
>> 09/13/2013 10:55:44;0002;PBS_Server.588556;Svr;PBS_Server;Server Ready,
>> pid = 588556, loglevel=0
>>
>> 09/13/2013 10:55:55;0002;PBS_Server.588560;Svr;PBS_Server;Torque Server
>> Version = 4.2.3.1, loglevel = 0
>>
>>
>>  The log while starting client on debug mode shows:
>>
>>   09/13/2013 10:55:49;0002; pbs_mom.588562;Svr;Log;Log opened
>>
>> 09/13/2013 10:55:49;0002; pbs_mom.588562;Svr;pbs_mom;Torque Mom Version =
>> 4.2.3.1, loglevel = 0
>>
>> 09/13/2013 10:55:50;0002;
>> pbs_mom.588562;Svr;setup_program_environment;machine topology contains 96
>> memory nodes, 1536 cpus
>>
>> 09/13/2013 10:55:50;0001;
>> pbs_mom.588562;Svr;pbs_mom;LOG_ERROR::read_layout_file, nodeboard 0 has no
>> nodeset
>>
>> 09/13/2013 10:55:50;0001;
>> pbs_mom.588562;Svr;pbs_mom;LOG_ERROR::setup_nodeboards, Could not read
>> layout file!
>>
>>
>>  It seems like the torque isn't able to find the mom.layout file, but
>> starting it using strace program I can see torque client opening and
>> reading the file.
>>
>>  Any help? Thanks in advance.
>>
>> --
>>  Att.
>> MSc. Alison Barros da Silva
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Att.
MSc. Alison Barros da Silva
Engenheiro da Computação
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130916/01c69282/attachment-0001.html 


More information about the torqueusers mailing list