[torqueusers] Help with NUMA support

David Beer dbeer at adaptivecomputing.com
Mon Nov 28 14:53:04 MST 2011



----- Original Message -----
> A colleague and I are trying to reconfigure a Linux system with
> TORQUE
> NUMA support.  Here are some details of the system
> 
> 1. 48 processor : 'lstopo' output gives 8 NUMA nodes, 6 cores/node.
> 
> 2. Debian linux running 2.6.32-5-amd64 kernel
> 
> 3. Open_mpi 1.5.3, configured with 'libnuma' support.
> 
> We previously had TORQUE successfully configured and running  without
> NUMA support, but this wasn't satisfactory for running multiple MPI
> jobs
> concurrently.  Here are the steps we have taken:
> 
> 1. Reconfigured TORQUE with --enable-num-support
> 
> 2. Created 'mom.layout' in /var/spool/torque/mom_priv with:
> 
> cpus=0-5     mem=0
> cpus=6-11    mem=1
> cpus=12-17   mem=2
> cpus=18-23   mem=3
> cpus=24-29   mem=4
> cpus=30-35   mem=5
> cpus=36-41   mem=6
> cpus=42-47   mem=7
> 
> based on the 'lstopo' output.
> 
> 3. created 'nodes' file in /var/spool/torque/server_priv with:
> 
> notus  np=48 num_numa_nodes=8
> 
> where 'notus' is the host name.
> 
> 
> 4. restarted 'pbs_mom', 'pbs_sched', and 'pbs_server'.
> 
> 
> 5. submitted MPI jobs with, e.g. '-l nodes=4:ppn=6' for PBS resources
> and 'mpirun -np 24' for MPI.
> 
> 
> With this we are getting the following error messages in the
> 'sched_logs' file:
> 
> 11/28/2011 12:10:18;0040; pbs_sched;Job;10.notus.nrl.navy.mil;Not
> enough
> of the right type of nodes available
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-0;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-1;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-2;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-3;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-4;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-5;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-6;Can not open
> connection
> to mom
> 11/28/2011 12:20:18;0002; pbs_sched;Req;notus-7;Can not open
> connection
> to mom
> 
> 
> What are we missing?  Any suggestions or advice?
> 
> T. Rosmond
> 

Tom, 

Are you running pbs_sched? pbs_sched has not been updated to support NUMA scheduling, and there are currently no plans to update it in order to make that happen. This must be disappointing but I'm sure you understand that we cannot do development that competes with the scheduler our company sells.

I'm afraid you're going to need to purchase a scheduler that can handle this kind of hardware.

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1712 S East Bay Blvd, Suite 300
     Provo, UT 84606



More information about the torqueusers mailing list