[torquedev] [Bug 78] fix routing of jobs based on nodes spec

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Mon Aug 23 23:59:13 MDT 2010


--- Comment #3 from Simon Toth <SimonT at mail.muni.cz> 2010-08-23 23:59:13 MDT ---
> I am fully aware that this patch is incomplete as far as future plans are
> concerned, e.g., bug 67 or combined requests like nodes=...+procs=...
> Thus, the patch repairs only the routing code that is in the current
> version (and probably almost all previous versions) that is clearly
> broken with respect to nodes specifications. As such it is needed - we are
> running into problems with "misrouted" jobs right now.

My comment wasn't about the "incompleteness" of this patch, but rather about
the fact, that the 67 patch already calculates all resources (into the
ResourcesTotal job attribute). Which can be then checked when routing. Much
better then hacking existing functions to "sort of" support nodespec IMHO.

> With respect to future plans: I would like to see a "standard" for the
> grammar first. E.g., the problem I ran into when writing the patch is
> the a nodes specification like
> -l nodes=4:ppn=8+n15+procs=15
> is actually quite ugly: currently the code uses isdigit to decide whether
> the entry after the + sign is a node name (like n15) or specifies a number
> of nodes (like 4).
> This logic obviously fails when we add something like "+procs=15" to the
> mix. Currently "procs" would actually be interpreted as a node name. If we
> do want to allow this we need to parse the nodes specification for procs first
> and then split the string appropriately. However, the string above is supposed
> to be equivalent to
> -l nodes=4:ppn=8+n15
> -l procs=15
> which actually works with the current code. Should this be the only way of
> specifying a combined nodes+procs resource?

Well, yes this is one of the issues that the 67 bug solves.

> With respect to bug 67: despite the explanations given I am still not clear
> what the proposed syntax actually means. E.g.,
> -l nodes=1:ppn=2:ncpus=3

That is a decision to be made. It's set using resource flags in the
resc_def_all. Current meaning is 1 node, 2 processes, 3 cpus per process. You
can easily set it to mean 1 node 2 processes 3 cpus total.

There are two approaches I have seen, when it came to ppn vs. ncpus. One is to
keep the logic of processes (which is odd and hard to grasp), the second one is
to make ppn and ncpus synonymous (which is a bit hard to enforce).

> Currently, nodes=1:ppn=2 means 1 node, 2 processors per node; ncpus=3 is
> basically equivalent to nodes=1:ppn=3 (even though this is not handled by
> torque, schedulers treat it that way and users are used to it). Thus, to me
> bug 67 changes the meaning of these resources requests. Is that what we want?
> And if yes, what does nodes=1:ppn=2:ncpus=3 mean? How does the procs resource
> get incorporated into this scheme?

Torques features are a bit behind in this area. Simply not using ppn and ncpus
together would be the good approach right now. Either set ppn=1 on each node
and set the cpus using ncpus or set ppn=cpus and don't set ncpus on the node.

> Frankly, I mostly care about nodes=x:ppn=y and procs=z and any combination
> of those when it comes to requesting processors. I need that to work. I am
> open to other changes, but currently I do not understand the meaning of the
> type of requests introduced in bug 67 and therefore have a hard time dealing
> with them.

I just realized that nodect and procs are actually not propagated into the
resources total attribute (but are still counted). I will look into this.

The correct result for -l nodes=x:ppn=y+procs=z would be resources total:
nodect=x, procs=x*y+z

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list