[torqueusers] Help with NUMA support
David Beer
dbeer at adaptivecomputing.com
Mon Nov 28 14:53:04 MST 2011
----- Original Message -----
> A colleague and I are trying to reconfigure a Linux system with
> TORQUE
> NUMA support. Here are some details of the system
>
> 1. 48 processor : 'lstopo' output gives 8 NUMA nodes, 6 cores/node.
>
> 2. Debian linux running 2.6.32-5-amd64 kernel
>
> 3. Open_mpi 1.5.3, configured with 'libnuma' support.
>
> We previously had TORQUE successfully configured and running without
> NUMA support, but this wasn't satisfactory for running multiple MPI
> jobs
> concurrently. Here are the steps we have taken:
>
> 1. Reconfigured TORQUE with --enable-num-support
>
> 2. Created 'mom.layout' in /var/spool/torque/mom_priv with:
>
> cpus=0-5 mem=0
> cpus=6-11 mem=1
> cpus=12-17 mem=2
> cpus=18-23 mem=3
> cpus=24-29 mem=4
> cpus=30-35 mem=5
> cpus=36-41 mem=6
> cpus=42-47 mem=7
>
> based on the 'lstopo' output.
>
> 3. created 'nodes' file in /var/spool/torque/server_priv with:
>
> notus np=48 num_numa_nodes=8
>
> where 'notus' is the host name.
>
>
> 4. restarted 'pbs_mom', 'pbs_sched', and 'pbs_server'.
>
>
> 5. submitted MPI jobs with, e.g. '-l nodes=4:ppn=6' for PBS resources
> and 'mpirun -np 24' for MPI.
>
>
> With this we are getting the following error messages in the
> 'sched_logs' file:
>
> 11/28/2011 12:10:18;0040; pbs_sched;Job;10.notus.nrl.navy.mil;Not
> enough
> of the right type of nodes available
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-0;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-1;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-2;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-3;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-4;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-5;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-6;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-7;Can not open
> connection
> to mom
>
>
> What are we missing? Any suggestions or advice?
>
> T. Rosmond
>
Tom,
Are you running pbs_sched? pbs_sched has not been updated to support NUMA scheduling, and there are currently no plans to update it in order to make that happen. This must be disappointing but I'm sure you understand that we cannot do development that competes with the scheduler our company sells.
I'm afraid you're going to need to purchase a scheduler that can handle this kind of hardware.
--
David Beer
Direct Line: 801-717-3386 | Fax: 801-717-3738
Adaptive Computing
1712 S East Bay Blvd, Suite 300
Provo, UT 84606
More information about the torqueusers
mailing list