[torqueusers] Problem with TM interface when using --enable-numa

Gareth.Williams at csiro.au Gareth.Williams at csiro.au
Fri Mar 30 19:18:11 MDT 2012


> -----Original Message-----
> From: Lukasz Flis [mailto:l.flis at cyf-kr.edu.pl]
> Sent: Wednesday, 28 March 2012 10:00 AM
> To: Torque Developers mailing list; Torque Users Mailing List
> Subject: [torqueusers] Problem with TM interface when using --enable-
> numa
> 
> Hi
> 
> It seems that TM interface in Torque 3.0.4 compiled with --enable-numa
> flag is broken.
> 
> Example:
> 
> qsub -I -l nodes=4:ppn=1
> qsub: waiting for job 307.batch-xsmp to start
> qsub: job 307.batch-xsmp ready
> 
> 
> [@xsmp4-3-1 ~]$ cat $PBS_NODEFILE
> xsmp4-3-1.local
> xsmp4-2-4.local
> xsmp4-2-3.local
> xsmp4-1-2.local
> 
> #mpiexec from openmpi compiled with TM support
> mpiexec uname -n
> xsmp4-3-1.local
> xsmp4-3-1.local
> xsmp4-3-1.local
> xsmp4-3-1.local
> 
> 
> The job above had been allocated 4 different nodes.
> However mpiexec or pbsdsh runs given command 4 times on the first of
> hosts from $PBS_NODE file
> 
> Is this desired behaviour? I haven't tested Torque 4.0 with numa but I
> suspect it could have the same problem.
> 
> Cheers
> --
> LKF

I see different behaviour with our 3.0.4-snap.201201051014 numa enabled setup and I think I see a/the problem/difference.

Our numa setup has a single host - cherax, with a set of logical numa-nodes cherax-0, cherax-1 ...
In jobs the $PBS_NODEFILE gets populated with the actual hostname - cherax, not the logical numa-node names and pbsdsh works fine afaik though I've not checked if launched processes are allocated to appropriate cores (I'd not necessarily expect that anyway).

It looks like you actually have a multi-node system and I had feedback some time ago that I couldn't run a numa enabled torque on such a system (yet). Your 'uname -n' output seems to indicate you are not using/getting a numa setup.

Gareth


More information about the torqueusers mailing list