[torquedev] Re: [torqueusers] Allocating resources by CPUs

Martin Siegert siegert at sfu.ca
Tue Mar 13 18:51:45 MDT 2007


On Tue, Mar 13, 2007 at 09:16:14AM -0600, Garrick Staples wrote:
> On Tue, Mar 13, 2007 at 07:48:26AM -0400, Thomas H Dr Pierce alleged:
> > Dear Torque MLs,
> > 
> > Is this a resource management issue or a scheduling issue? eg is it part 
> > of what TORQUE should do , or a policy statement that MAUI should be 
> > doing? I have always be somewhat puzzled and wondered which system 
> > actually selects the specific nodes to use in a job run.
> 
> 
> It's both because pbs_server keeps track of assigned "vnodes".
> 
> pbs_server has it's own resource allocator internally.  When you use
> pbs_sched or 'qrun', you are using pbs_server's allocator.  pbs_sched
> merely determines when and which job will run next, leaving it up to
> pbs_server to figure out which nodes to assign.
> 
> maui/moab take a different approach and do the resource scheduling *and*
> node assignment; passing the node list back to pbs_server when starting
> the job.  This is equivalent to using qrun's -H option.
> 
> As far as determining the meaning of "nodes" versus "nodes:ppn", that's
> up to maui.

I know that the interpretation of the meaning of "nodes" versus "nodes:ppn"
is currently left to maui/moab. But it shouldn't!
Maui/Moab (or the sysadmin who configured it) cannot read the user's mind.
Most of the sysadmins within WestGrid by now have configured moab with
JOBNODEMATCHPOLICY EXACTNODE
because that's simply what a user expects when submitting a job with
-l nodes=x:ppn=1
However, in that configuration there is no way a user can submit a
x processor job "anywhere on the cluster" anymore. As a consequence
queue waiting times are substatially longer.

Thus, there is no way to have the scheduler "interpret" the meaning
of the "nodes" settings. That's just flawed. The scheduler should
simply do what the user asks for. And that requires that the user
can exactly specify what resourced are required. This is what
currently does not work. And that's why this is a torque problem.

Cheers,
Martin
 
-- 
Martin Siegert
Head, HPC at SFU
WestGrid Site Lead
Academic Computing Services                phone: (604) 291-4691
Simon Fraser University                    fax:   (604) 291-4242
Burnaby, British Columbia                  email: siegert at sfu.ca
Canada  V5A 1S6


More information about the torquedev mailing list