[torquedev] [Bug 86] Implement transparent resource limits

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Wed Oct 6 13:03:05 MDT 2010


--- Comment #5 from Eygene Ryabinkin <rea+maui at grid.kiae.ru> 2010-10-06 13:03:05 MDT ---
(In reply to comment #4)
> Well, the server doesn't have any idea what a resource is (right now). You can
> specify resources, but the server is pretty much oblivious to their existence
> with the exception of resource limits on queues an server (which are enforced).

May be the plain Torque server isn't aware of resources, but I am always using
Torque/Maui combo and Maui certainly knows that are the resources and how to
schedule the things basing on the reported resources.

> This adds all the support around resources that makes sense. Like also checking
> the nodespec for resource requests, multiplying requests that are per-proces by
> the correct value (ppn=2:vmem=2G ->4G), etc...

I think that Maui does it (at least, it understands the multiplication of ppn
by vmem).

> From the description I'm guessing that my patch already does what you want but
> instead of killing the jobs when they reach the node, mine already rejects the
> run request (so the job is never run in the first place).

No, that is the thing that I completely want to avoid: no scheduling decisions
must be made basing on the transparent resource limits (server/queue
configuration attribute leaf resource_limits) and job reject _is_ the
scheduling decision.  What I need is to say "If that job _in the process of its
execution_ exceeds the specified limit, kill it".  It is ulimit on steroids or
"MOM-powered per-queue ulimit over the Torque protocol" (tm).

The real reason why I created that patch is that our Grid cluster was drowned
with the jobs that ate 15-25 Gb of virtual memory and, given that we mostly
have 8 slot machines, OOM killer was pretty busy on them; so busy that some
kernel threads weren't waked up for 3-4 minutes.

But when I tried to use resources_max/resources_default, Maui started to
underfill our slots, because resources_max/resources_default are transformed to
the job requirements and not only enforced on the MOM side.  So, the codename
"transparent" was born ;))

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list