[torquedev] First Torque impressions on Altix

Chris Samuel csamuel at vpac.org
Wed Dec 24 04:21:39 MST 2008

----- "Michel Béland" <michel.beland at rqchp.qc.ca> wrote:

> Hello,

Hiya Michel,

I'll skip over the bits that Glen commented on previously..

> We installed at our site version 2.3.5 of Torque, compiled with
> --enable-cpuset on an Altix 3300 with four processors and 8 GB.
> The machine has two nodes with two processors each. The memory
> is also split in two between these nodes.

Just to be clear, I'm presuming that here you mean NUMA nodes
and not compute nodes ?

Just "node" usually means the second, but I suspect you mean
the first from our previous conversations. :-)

> - Firstly, the server would do a segmentation fault. 

Yuck, if I get the chance over the holiday I'll try and
see if I can reproduce that and then see if I can see how
to fix it (not being a programmer I can't promise anything!).

> - Thirdly, cpusets contain only cpu 0 when they are
> launched with -lncpus instead of -lnodes.

Oh that's ugly!  I wonder if that's a symptom of a more
general issue ?

Just doing a test interactive job here:

qsub -q sque -l ncpus=2 -I

I see what you see with the cpuset, but if I do a checkjob
on it (we use Moab) I see:

Req[0]  TaskCount: 1  Partition: production
Memory >= 4000M  Disk >= 0  Swap >= 0
Opsys:   ---  Arch: ---  Features: ---
Dedicated Resources Per Task: PROCS: 2  MEM: 4000M
NodeCount:  1

Allocated Nodes:

and qstat confirms that:

    exec_host = tango068/2

so that implies that just a single CPU was allocated.

Checknode confirms that..

  677992x1  Job:Running  -00:00:10 -> 00:09:50 (00:10:00)

This is, umm, sub-optimal.. :-(

However, if I specify nodes=1:ppn=2 I get the expected behaviour.

I'm not sure if this is a Torque bug or a Moab bug!

> As -lncpus is intended for shared memory machines,
> I think that it ought to work correctly.


> With older versions of PBS Pro, probably more similar to
> Torque then  than it is today, using -lnodes on Altix did
> not work quite well: memory requests were ignored. I do not
> know if this problem would appear with Torque, though.

Were they ignored by PBS Pro or were they getting set
and the kernel was ignoring them ?

One of the things on my todo list is to get pbs_mom to
cope with the change that happened around malloc() in
glibc where it went from always using brk() to using
brk() for small allocations below a certain value
(usually 128KB I think) and mmap() for larger allocations.

The issue here is that mem and pmem limits are currently
set using the RLIMIT_DATA resource which is honoured by
brk() but not mmap().   Changing that to be RLIMIT_AS
should fix it (it's honoured by both) but I need to test
to see if it has more "interesting" implications..

> - Fourthly, when I submit a sequential job followed
> by a 2-cpu job, the first jobs gets a cpuset with
> cpu 0 and the second a cpuset with cpus 1 and 2. This
> is pretty annoying: the second job should get cpus 2 and
> 3 so that they are on the same node.

The scheduler doesn't usually spread jobs across nodes if
it doesn't have to, so I suspect here you mean this is
across NUMA nodes.

> In fact, if cpus 0 and 2 were busy, I would expect the job
> to remain queued.

Hmm, I think that has to be a local site policy decision.

> I realize that this means good knowledge of the cpusets
> by the scheduler.

It's more that it needs good knowledge of the NUMA layout of
the system so it can create better allocations for cpusets.

Currently pbs_mom just uses the vnode numbers allocated
by the scheduler for the core numbers to put into the
cpuset, so either we get the scheduler to know about NUMA
or we do some mapping in the pbs_mom to a better layout
(if available).

I think the second is simplest short term, but not
as flexible.  The former will need a lot of careful
negotiation to figure out how to do that without
breaking things.

> As I want to make this work with pbs_sched or Maui because
> of budget constraints, the best way to make this work for me
> is to make sure that all the jobs use one or more complete node.

Do you mean a complete compute node or a complete NUMA node ?

If NUMA I don't think there's any way defined to request
then in the PBS spec..

> - Fifthly, the cpusets contain all the nodes for memory,
> instead of just the nodes needed according to the memory
> request.

Correct, that's something that I want to be able to solve
with the NUMA support.

> I guess that I can probably easily change Torque to
> restrict the memory, provided that I use the dummy
> qsub script described above.

Well you can restrict the memory with ulimits, but
that won't control which NUMA node it ends up being
allocated on..

Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

More information about the torquedev mailing list