[torquedev] Re: [torqueusers] Torque notes from SC'08

Chris Samuel csamuel at vpac.org
Wed Dec 3 23:34:07 MST 2008

----- "Michel Béland" <michel.beland at rqchp.qc.ca> wrote:

Hello Michel,

It was good to meet you at SC!

> Here is what I get on our Altix 4700:
> Again, here is what I get on our Altix 4700 (one 256-processor
> partition, in fact) for node 0:

Thanks for that, much appreciated.

> What is peculiar is that it does not always increase, due to the
> funny topology of the level-0 routers, that I described above.

Yeah, these things are, umm, "interesting".. :-)

> Considering the way it behaves on my system, I think listing these
> latencies out is better. But anyway, why send this to the scheduler?

I was pondering about making such decisions in pbs_mom, but
people were asking about being able to make policy decisions
on such things, and the place for that is in the scheduler.

Also, on smaller NUMA systems (such as our dual socket Opterons)
the scheduler has to be more informed about these things as it
needs to decide which node to run a job on; and if it doesn't have
visibility of the amount of RAM on each NUMA node on the system
(or even how many NUMA nodes there are) then it may not make the
best decision on where to place jobs.

Consider a tiny cluster with two 8 core nodes and 16GB RAM, the
scheduler can't know at present that if they've both got a 6 core
job using 12 GB of RAM that the first is a quad socket (4 NUMA node)
dual core system and the second is a dual socket (2 NUMA node)
system and so it would be better off to put a 2 CPU 2GB RAM job
onto the first because there is no contention on its memory

*However*, as an initial implementation then just doing it in the
pbs_mom would be a nice quick method as that would permit it to go
ahead without any modifications being necessary to any of the
schedulers (which require a lot of coordination).

> It is much easier to just let pbs_mom pick the nodes to be used by the
> cpuset, according to the memory and cpu requirements of the job.

That makes sense on a large Altix, but on most clusters I think
the scheduler does need such information.

> PBS Pro does use the scheduler for that, but it does it by virtualizing
> the Altix nodes so that they are treated like machines by themselves. I
> am not sure that Torque is ready for that yet.

I'm not sure that's necessary if we pbs_mom returns information
about the layout and the scheduler can specify the layout of a
job via exec_host.

This is what Dave Singleton's fork of PBS is doing already (at
least in terms of the number of NUMA nodes and memory being
used, the distances are hardwired into the scheduler from what
he's told me).

> > Much more friendly to pbs_server, though not as artistic as
> > Garrick's idea of drawing the layout of the system in ASCII
> > art and letting the scheduler folks work it out from that. :-)
> Here are the concatenated distance files for a 16-processor Altix
> 350, which does not have any router:
> 10 21 21 27 27 33 33 39
> 21 10 51 21 45 27 39 33
> 21 51 10 45 21 39 27 33
> 27 21 45 10 39 21 33 27
> 27 45 21 39 10 33 21 27
> 33 27 39 21 33 10 27 21
> 33 39 27 33 21 27 10 21

Hmm, did you miss node 7 out of that list ?  It's only got
nodes 0 - 6 from the look of it..

> You can see that the first line is not enough for this machine...

Oh I wasn't meaning to imply it was, as I talked about the fact
that we need to return the matrix of all those distances to the
pbs_server for each compute node.

> I wonder if something is broken in there, though. Here is the topology
> of the machine:
> 4 - 6 - 7
> |       |
> 2       5
> |       |
> 0 - 1 - 3
> You can see that the latency from node 0 to 1 or 2 is 21, but it is
> 51 between nodes 1 and 2, as if it went through 6 connections instead
> of 2 (going almost all the way around the circle, or should I say
> square).

It looks like it's 51 between nodes 2 and 3, which probably
makes sense because adding the latencies of the shortest path
from 2->0, 0->1 and then 1->3 (each of which are 21) gives you
63, and the value of 21 per hop will include costs that the
transaction directly between 2->3 probably won't incur.

> Shown above for our Altix 4700. Each node has two sockets and 4 cores
> (we have dual-core Montecito processors).


> > If the system isn't running a kernel new enough to have that
> > information we might just have to assume that all sockets are
> > equi-distant.
> I am not sure that all this information is really needed, at
> least in a first step.

Possibly, and it might be a useful starting point, especially
if it's modularised so when later on the scheduler tells us
what to pick it's easy to drop the old logic and include the
new without needing to touch the functions to add and remove
mems from a cpuset.

> This might be nice to know the best nodes to pick for a job.

Indeed, which is precisely our problem here. :-)

> But the really import thing to me is that jobs should be given in
> their cpuset whole nodes for their cpu and memory requirements.
> The scheduler needs then to know about this so that it does
> not try to give some part of these nodes to other jobs.

I think that has to be a local policy decision, we certainly
couldn't do that on our Opteron cluster with 4 cores per NUMA
node as we have a huge user population to service.

This is another reason for it to go in the scheduler.

> The original memory and cpu requirements should always be kept
> in case the job needs to be restarted.

The nodes and mem will be, do you mean the vnode and NUMA node
allocations too ?

> Altair did something like this with PBS Pro in some previous version,
> but they reverted back to the original exec_host syntax. They added
> instead a new string called exec_vnode that looked like this:

I think I'd rather not look at what Altair have done,
just in case.. :-(

Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

More information about the torquedev mailing list