[torqueusers] Job Allocation on Nodes
Gareth.Williams at csiro.au
Gareth.Williams at csiro.au
Thu Mar 8 13:59:31 MST 2012
> -----Original Message-----
> From: Bill Wichser [mailto:bill at Princeton.EDU]
> Sent: Thursday, 8 March 2012 11:48 AM
> To: Torque Users Mailing List
> Cc: Williams, Gareth (CSIRO IM&T, Docklands)
> Subject: Re: [torqueusers] Job Allocation on Nodes
> On 3/7/2012 7:30 PM, Gareth.Williams at csiro.au wrote:
> >> Perhaps this question has been answered before. I have users who
> want to distribute jobs equally amongst nodes. What I am observing at
> the moment is that when a user submits a job with nodes=12:ppn=3, the
> job uses three nodes with 12 cores per node. Is there a way to make
> the job use only three cores per node. How can I prevent this or setup
> some kind of affinity for following the user's job requirements?
> > Hi Randall,
> > Why would you want to do such a thing? If the user submits four of
> the jobs they will align, and you will get worse contention. I would
> suggest: if you need to spread jobs to access memory then you should
> schedule memory and/or if you need to avoid contention, say for memory
> bandwidth, then get the users to request whole nodes (all the available
> ppn) and only run as many processes as their scaling permits (they may
> need custom mpirun options).
> > Gareth
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> I have users who desire to do just this -- maximize memory bandwidth
> their application. It turns out that sharing the node with others
> always provides better memory bandwidth than running the full node with
> the job. This can be reproduced quantitatively while looking at
> walltime only. Sometimes allocating multiple cores to cover memory use
> is required but the --bynode flag for openmpi is always used.
> So memory contention is overcome and the node can be shared even with
> the same user's jobs as this contention tends to run in cycles instead
> of overlapping.
I think it is worthwhile having such postings on the list. I agree as a generalization that we gain efficiency through overlapping demand - after all that is part of having a big cluster rather than many separate clusters.
However, the key to HPC is allocating (dedicated) resources to jobs, but we have to choose a granularity of how dedicated is dedicated. Dedicating cpus/cores is sort of easy, but getting harder with multi-core. I'd suggest that dedicating memory bandwidth (and memory itself) is next most important. I'm resigned to have to share (bandwidth to) network and storage though that can be moderated by layout of multi-process jobs.
If your site gets useful efficiency by spreading jobs to overlap memory bandwidth utilization then that is a good solution for you. We prefer a more conservative approach where the there is less scope for jobs to impact on one another. This is a choice that the cluster manager or support team need to make and it helps to have this information available to inform such decisions.
More information about the torqueusers