[torqueusers] Problem with ppn and routing : Possible way to get the routing you want continued.

J.A. Magallón jamagallon at ono.com
Thu Dec 2 17:07:19 MST 2010


Hi...

Sorry if I do some childish questions, but I have not been able to get
the answers or misunderstood some things I read, perhap you could include this
in a kind of newbie FAQ, if not already there...

On Thu, 2 Dec 2010 14:33:01 -0700 (MST), Ken Nielson <knielson at adaptivecomputing.com> wrote:

> The TORQUE "resource manager" knows nothing of ncpus. When a job is submitted and the ncpus keyword is used the string is passed through to the scheduler. In the case of PBS this would be pbs_sched. If you run torque without a scheduler and call qrun for a job you will get a single node and a single processor to run the job.

I was not aware that torque was pretty useless without a scheduler, ie,
jobs just sit in the queue in Q state if pbs_sched is not running. I had
understood that without a scheduler there was a FIFO policy, and that a
scheduler just let you do more complicated things. But using pbs_sched
is mandatory for a real system. IT is THE FIFO.

...
> >> ncpus and nodes are competing ways to specify a resource request. Don't mix
> >> them and everything works better.
> >>
> >> ncpus pre-dates clusters and is used to specify the number of cpus on 1 node.
> >> nodes was grafted into OpenPBS later in life to deal with clusters.
> >>
> >>
> > The TORQUE resource manager has no concept of ncpus. It is interpreted
> > by the scheduler. In the case of Moab it indicates the number of procs
> > to be used per task.
> >
> I don't know much about how ncpus are handled, but torque must have
> some kind of concept, as it predates nodes, and I don't think anyone
> went through and stripped out all kinds of code related to ncpus.  I
> do know that your $PBS_NODEFILE only gets one host for requests using
> ncpus unless you are using something like Maui or Moab.

If you read man pbs_resources, 'nodes' is described with the full spec,
including 'ppn', needed for job submission. The if you read docs about
queue limits here

http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml#resources_max

it looks like resource limits are just the same, but it not really, ppn is
ignored, for example. Then that page mentions 'nodect', not referenced
in man pbs_resources. So I'm really messed up...

Lets suppose for further examples I have a cluster with 16 quad-core boxen.

So if I understand things, in queues I can only limit the number of nodes,
not ppn. So I can set up a queue for long jobs, allowing a max of 4 nodes
for example. That queue will allow jobs like nodes=1:ppn=4 or nodes=4:ppn=4.
It just will be able to run only one 4:4 job, or two 2:4, or two 4:2, 
or sixteen 1:1, etc...
But I can not limit the number of ppn's to force jobs to ppn=2 and be able
to run several long jobs in parallel.

Thinking about it, perhaps it is enough to limit the number of 'cores' a
job can use. Suppose I limit it to 4 (from 16), so max valid jobs would be
like 1:4, 2:2, 4:1. Is there a way to do that ? Limit on total number of
cores used ? ncpus is for a single node (not total) and outdated, procs is
the same as nodes, nodect is also the same, so nothing seems to fit for
this.

TIA

-- 
J.A. Magallon <jamagallon()ono!com>     \               Software is like sex:
                                         \         It's better when it's free


More information about the torqueusers mailing list