[torquedev] Re: [torqueusers] Torque notes from SC'08

Michel Béland michel.beland at rqchp.qc.ca
Thu Dec 4 09:20:07 MST 2008


Hello Chris,

I dropped torqueusers from the recipients, here.

> Hello Michel,
> 
> It was good to meet you at SC!

It was good to meet you too.

>> What is peculiar is that it does not always increase, due to the
>> funny topology of the level-0 routers, that I described above.
> 
> Yeah, these things are, umm, "interesting".. :-)
> 
>> Considering the way it behaves on my system, I think listing these
>> latencies out is better. But anyway, why send this to the scheduler?
> 
> I was pondering about making such decisions in pbs_mom, but
> people were asking about being able to make policy decisions
> on such things, and the place for that is in the scheduler.

OK.

> Also, on smaller NUMA systems (such as our dual socket Opterons)
> the scheduler has to be more informed about these things as it
> needs to decide which node to run a job on; and if it doesn't have
> visibility of the amount of RAM on each NUMA node on the system
> (or even how many NUMA nodes there are) then it may not make the
> best decision on where to place jobs.

> Consider a tiny cluster with two 8 core nodes and 16GB RAM, the
> scheduler can't know at present that if they've both got a 6 core
> job using 12 GB of RAM that the first is a quad socket (4 NUMA node)
> dual core system and the second is a dual socket (2 NUMA node)
> system and so it would be better off to put a 2 CPU 2GB RAM job
> onto the first because there is no contention on its memory
> controller.

If you go this route, you should have a way to let the administrator
define its own topology, in case Torque gets it wrong.

>> PBS Pro does use the scheduler for that, but it does it by virtualizing
>> the Altix nodes so that they are treated like machines by themselves. I
>> am not sure that Torque is ready for that yet.
> 
> I'm not sure that's necessary if we pbs_mom returns information
> about the layout and the scheduler can specify the layout of a
> job via exec_host.

> This is what Dave Singleton's fork of PBS is doing already (at
> least in terms of the number of NUMA nodes and memory being
> used, the distances are hardwired into the scheduler from what
> he's told me).

OK, that is one way to do it. But again I advise not to change exec_host
in Torque, but add a new string instead, to keep it backward compatible
with other programs that read exec_host (like mpiexec, for example).
Virtualizing nodes has other advantages, like being able to put a
virtual node offline, for example, but that is another story.

>> Here are the concatenated distance files for a 16-processor Altix
>> 350, which does not have any router:
>>
>> 10 21 21 27 27 33 33 39
>> 21 10 51 21 45 27 39 33
>> 21 51 10 45 21 39 27 33
>> 27 21 45 10 39 21 33 27
>> 27 45 21 39 10 33 21 27
>> 33 27 39 21 33 10 27 21
>> 33 39 27 33 21 27 10 21
> 
> Hmm, did you miss node 7 out of that list ?  It's only got
> nodes 0 - 6 from the look of it..

You are right, here is the complete output. I must have made a mistake
with cut and paste.

10 21 21 27 27 33 33 39
21 10 51 21 45 27 39 33
21 51 10 45 21 39 27 33
27 21 45 10 39 21 33 27
27 45 21 39 10 33 21 27
33 27 39 21 33 10 27 21
33 39 27 33 21 27 10 21
39 33 33 27 27 21 21 10


>> You can see that the first line is not enough for this machine...
> 
> Oh I wasn't meaning to imply it was, as I talked about the fact
> that we need to return the matrix of all those distances to the
> pbs_server for each compute node.
> 
>> I wonder if something is broken in there, though. Here is the topology
>> of the machine:
>>
>> 4 - 6 - 7
>> |       |
>> 2       5
>> |       |
>> 0 - 1 - 3
>>
>> You can see that the latency from node 0 to 1 or 2 is 21, but it is
>> 51 between nodes 1 and 2, as if it went through 6 connections instead
>> of 2 (going almost all the way around the circle, or should I say
>> square).
> 
> It looks like it's 51 between nodes 2 and 3, which probably
> makes sense because adding the latencies of the shortest path
> from 2->0, 0->1 and then 1->3 (each of which are 21) gives you
> 63, and the value of 21 per hop will include costs that the
> transaction directly between 2->3 probably won't incur.

The value of 51 is between nodes 1 and 2, not 2 and 3 (which is 45). I
expected to get 27 instead of 51.  The latency increases like this on
the Altix 350:

hops   latency
0      10
1      21
2      27
3      33
4      39
5      45
6      51

So you have a large penalty when you first get out of a node, but then
it does not increase as much.

>> This might be nice to know the best nodes to pick for a job.
> 
> Indeed, which is precisely our problem here. :-)

I think that I have missed part of the original discussions about
cpusets and their shortcomings in Torque.

>> But the really import thing to me is that jobs should be given in
>> their cpuset whole nodes for their cpu and memory requirements.
>> The scheduler needs then to know about this so that it does
>> not try to give some part of these nodes to other jobs.
> 
> I think that has to be a local policy decision, we certainly
> couldn't do that on our Opteron cluster with 4 cores per NUMA
> node as we have a huge user population to service.
> 
> This is another reason for it to go in the scheduler.

OK, I understand that it does not fit all needs, it should be
configurable. In fact, it should be configurable from one job to the
next (two small jobs could share a node, but large parallel jobs should
not).

>> The original memory and cpu requirements should always be kept
>> in case the job needs to be restarted.
> 
> The nodes and mem will be, do you mean the vnode and NUMA node
> allocations too ?

No, not the node allocations. If one asks for 4 cpus and 20 GB on our
Altix 4700, the job will get three nodes (12 cpus and a little less than
23 GB, because some memory has to be given to the operating system). One
way to achieve this on Torque might be to increase the cpu and memory
requirements to request complete nodes. I think that the Moab
documentation advises to do this in a qsub replacement script (that
would call the real qsub) if it is used with PBS Pro. But how to do this
in a script when nodes do not have exactly the same memory? The
/sys/devices/system/node/node*/meminfo files show small variations. If
it is done by the scheduler when the job is scheduled to run, it can
fill the selected nodes all right, but if the job is restarted for some
reason, it might run on nodes with slightly less memory, forcing the
scheduler to request another node for the job while it is not really needed.

We have seen this happen with an old version of PBS Pro on our Altix
machines.


-- 
Michel Béland, analyste en calcul scientifique
michel.beland at rqchp.qc.ca
bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
téléphone   : 514 343-6111 poste 3892     télécopieur : 514 343-2155
RQCHP (Réseau québécois de calcul de haute performance)  www.rqchp.qc.ca


More information about the torquedev mailing list