[torquedev] [Bug 78] fix routing of jobs based on nodes spec

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Mon Aug 23 18:57:34 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=78

--- Comment #2 from Martin Siegert <siegert at sfu.ca> 2010-08-23 18:57:34 MDT ---
I am fully aware that this patch is incomplete as far as future plans are
concerned, e.g., bug 67 or combined requests like nodes=...+procs=...
Thus, the patch repairs only the routing code that is in the current
version (and probably almost all previous versions) that is clearly
broken with respect to nodes specifications. As such it is needed - we are
running into problems with "misrouted" jobs right now.

With respect to future plans: I would like to see a "standard" for the
grammar first. E.g., the problem I ran into when writing the patch is
the a nodes specification like

-l nodes=4:ppn=8+n15+procs=15

is actually quite ugly: currently the code uses isdigit to decide whether
the entry after the + sign is a node name (like n15) or specifies a number
of nodes (like 4).
This logic obviously fails when we add something like "+procs=15" to the
mix. Currently "procs" would actually be interpreted as a node name. If we
do want to allow this we need to parse the nodes specification for procs first
and then split the string appropriately. However, the string above is supposed
to be equivalent to

-l nodes=4:ppn=8+n15
-l procs=15

which actually works with the current code. Should this be the only way of
specifying a combined nodes+procs resource?

With respect to bug 67: despite the explanations given I am still not clear
what the proposed syntax actually means. E.g.,
-l nodes=1:ppn=2:ncpus=3
Currently, nodes=1:ppn=2 means 1 node, 2 processors per node; ncpus=3 is
basically equivalent to nodes=1:ppn=3 (even though this is not handled by
torque, schedulers treat it that way and users are used to it). Thus, to me
bug 67 changes the meaning of these resources requests. Is that what we want?
And if yes, what does nodes=1:ppn=2:ncpus=3 mean? How does the procs resource
get incorporated into this scheme?

Frankly, I mostly care about nodes=x:ppn=y and procs=z and any combination
of those when it comes to requesting processors. I need that to work. I am
open to other changes, but currently I do not understand the meaning of the
type of requests introduced in bug 67 and therefore have a hard time dealing
with them.

- Martin

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list