[torquedev] Re: [torqueusers] Torque notes from SC'08

Chris Samuel csamuel at vpac.org
Wed Dec 24 03:23:23 MST 2008

----- "Michel Béland" <michel.beland at rqchp.qc.ca> wrote:

> Hello Chris,


Apologies for taking so long to reply, I've been working
on this draft since the 8th December but been flat out
with odd hardware issues! :-(

> If you go this route, you should have a way to let the
> administrator define its own topology, in case Torque
> gets it wrong.

...or is not available due to being an old kernel.

Makes sense.

> OK, that is one way to do it. But again I advise not to change
> exec_host in Torque, but add a new string instead, to keep it
> backward compatible with other programs that read exec_host
> (like mpiexec, for example).

A very interesting point, and one I'd not realised!

That makes it look like it's really necessary to not fiddle
with that. :-(

> Virtualizing nodes has other advantages, like being able to put a
> virtual node offline, for example, but that is another story.


> The value of 51 is between nodes 1 and 2, not 2 and 3 (which is 45).

Oops, that was an count-from-one error on my part, sorry! :-(

> So you have a large penalty when you first get out of a node, but
> then it does not increase as much.

Yup, that's what Dave S. was saying to me, my analogy was
an energy barrier. :-)

> >> This might be nice to know the best nodes to pick for a job.
> > 
> > Indeed, which is precisely our problem here. :-)
> I think that I have missed part of the original discussions
> about cpusets and their shortcomings in Torque.

We might be confusing the issue using "nodes" to mean both
NUMA nodes and compute nodes, I was meaning compute nodes in
that sense.

E.g. if you've got two dual socket quad core nodes, each
with 4 cores free, you want to pick the one with a whole
socket free over the one with 1 core free on one socket
and 3 on the other.

> >> The original memory and cpu requirements should always be kept
> >> in case the job needs to be restarted.
> > 
> > The nodes and mem will be, do you mean the vnode and NUMA node
> > allocations too ?
> No, not the node allocations. If one asks for 4 cpus and 20 GB
> on our Altix 4700, the job will get three nodes (12 cpus and a
> little less than 23 GB, because some memory has to be given to
> the operating system).

Now at that point you're talking NUMA nodes and not compute
nodes, yes ?

Does the job get to access all 12 of those CPUs are are they
just marked as inaccessible to other jobs ?

> One way to achieve this on Torque might be to increase the cpu
> and memory requirements to request complete nodes.

But if you had (say) an MPI job partitioned for N cores
and you were using a launcher that was TM aware then the
user might not be happy to find that his careful work has
been thwarted.

That might just be a user education issue though ("well
scale it up to X cores instead then").

> The /sys/devices/system/node/node*/meminfo files show small
> variations. If it is done by the scheduler when the job is
> scheduled to run, it can fill the selected nodes all right,
> but if the job is restarted for some reason, it might run on
> nodes with slightly less memory, forcing the scheduler to
> request another node for the job while it is not really needed.

At the moment with Torque that might be academic as the BLCR
checkpoint restart work scheduled to be in 2.4 doesn't support
parallel jobs (as you need support in the MPI stacks for instance).

But in the general case yes, I can see that happening and I
can't really see of a way around it, the system is very unlikely
to be close to the state it was in when the job was suspended.

> We have seen this happen with an old version of PBS Pro on
> our Altix machines.

It might be a necessary price to pay for C/R or S/R
with jobs on these large systems. :-(

All the best!
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

More information about the torquedev mailing list