[torquedev] [Bug 86] Implement transparent resource limits

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Wed Oct 6 16:06:07 MDT 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=86

--- Comment #6 from Simon Toth <SimonT at mail.muni.cz> 2010-10-06 16:06:06 MDT ---
> No, that is the thing that I completely want to avoid: no scheduling decisions
> must be made basing on the transparent resource limits (server/queue
> configuration attribute leaf resource_limits) and job reject _is_ the
> scheduling decision.  What I need is to say "If that job _in the process of its
> execution_ exceeds the specified limit, kill it".  It is ulimit on steroids or
> "MOM-powered per-queue ulimit over the Torque protocol" (tm).

Why would you want to do that? That's super ineffective. You will allow the job
grow over the limit, but kill it when it happens?

> The real reason why I created that patch is that our Grid cluster was drowned
> with the jobs that ate 15-25 Gb of virtual memory and, given that we mostly
> have 8 slot machines, OOM killer was pretty busy on them; so busy that some
> kernel threads weren't waked up for 3-4 minutes.

Well, why don't you limit the amount of the the memory in the first place?

> But when I tried to use resources_max/resources_default, Maui started to
> underfill our slots, because resources_max/resources_default are transformed to
> the job requirements and not only enforced on the MOM side.  So, the codename
> "transparent" was born ;))

Well, that's definitely a Maui configuration problem and has pretty much
nothing with Torque. Not a very good idea to fix a Maui configuration problem
with a patch for Torque :-D

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list