Bug 86 - Implement transparent resource limits
: Implement transparent resource limits
Status: NEW
Product: TORQUE
pbs_server
: 2.5.x
: Other Linux
: P5 enhancement
Assigned To: Glen
:
:
:
  Show dependency treegraph
 
Reported: 2010-10-06 07:37 MDT by Eygene Ryabinkin
Modified: 2010-10-07 14:07 MDT (History)
2 users (show)

See Also:


Attachments
Patch that implements transparent resource limiting (64.37 KB, patch)
2010-10-06 07:38 MDT, Eygene Ryabinkin
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description Eygene Ryabinkin 2010-10-06 07:37:25 MDT
The comments in the attached patch speak for themselves, but in short, I wanted
to implement server/queue resource limits that are enforced on the MOM side,
but aren't passed to the scheduler and, thus, doesn't affect scheduling
process.

My real-world example is the vmem limitation when I just want to kill
outrageous jobs and don't want our Maui to consider virtual memory requirements
in the scheduling process, as it would do if I will use
resources_default/resources_max.

This patch is currently being tested on our production cluster running Torque
2.5.2.  No problems were spotted up to date.
Comment 1 Eygene Ryabinkin 2010-10-06 07:38:20 MDT
Created an attachment (id=54) [details]
Patch that implements transparent resource limiting
Comment 2 Simon Toth 2010-10-06 11:04:30 MDT
Is it compatible with generic resources support?

http://www.clusterresources.com/bugzilla/show_bug.cgi?id=67

I'm asking instead of going through the patch, because the patch is rather
long.
Comment 3 Eygene Ryabinkin 2010-10-06 11:43:04 MDT
(In reply to comment #2)
> Is it compatible with generic resources support?
> 
> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=67
> 
> I'm asking instead of going through the patch, because the patch is rather
> long.

Your patches in #67 are long too, so I had just went through the initial
explanations in this ticket.  It seems to me that the patches are completely
orthogonal, because what mine does is it just transfers server or per-queue
resource limits to the MOM as an additional job attribute and them MOM checks
if these resource limits aren't exceeded by the job.

Your patch seem to add the ability to specify the node properties (read,
resources).  Mine doesn't care about it, it just enforces the limits and kills
the jobs.

Feel free to correct me, because I don't understand what your patch does, even
having read all comments in #67 :(
Comment 4 Simon Toth 2010-10-06 11:55:48 MDT
(In reply to comment #3)
> (In reply to comment #2)
> > Is it compatible with generic resources support?
> > 
> > http://www.clusterresources.com/bugzilla/show_bug.cgi?id=67
> > 
> > I'm asking instead of going through the patch, because the patch is rather
> > long.
> 
> Your patches in #67 are long too, so I had just went through the initial
> explanations in this ticket.  It seems to me that the patches are completely
> orthogonal, because what mine does is it just transfers server or per-queue
> resource limits to the MOM as an additional job attribute and them MOM checks
> if these resource limits aren't exceeded by the job.
> 
> Your patch seem to add the ability to specify the node properties (read,
> resources).  Mine doesn't care about it, it just enforces the limits and kills
> the jobs.
> 
> Feel free to correct me, because I don't understand what your patch does, even
> having read all comments in #67 :(

Well, the server doesn't have any idea what a resource is (right now). You can
specify resources, but the server is pretty much oblivious to their existence
with the exception of resource limits on queues an server (which are enforced).

This adds all the support around resources that makes sense. Like also checking
the nodespec for resource requests, multiplying requests that are per-proces by
the correct value (ppn=2:vmem=2G ->4G), etc...

From the description I'm guessing that my patch already does what you want but
instead of killing the jobs when they reach the node, mine already rejects the
run request (so the job is never run in the first place).
Comment 5 Eygene Ryabinkin 2010-10-06 13:03:05 MDT
(In reply to comment #4)
> Well, the server doesn't have any idea what a resource is (right now). You can
> specify resources, but the server is pretty much oblivious to their existence
> with the exception of resource limits on queues an server (which are enforced).

May be the plain Torque server isn't aware of resources, but I am always using
Torque/Maui combo and Maui certainly knows that are the resources and how to
schedule the things basing on the reported resources.

> This adds all the support around resources that makes sense. Like also checking
> the nodespec for resource requests, multiplying requests that are per-proces by
> the correct value (ppn=2:vmem=2G ->4G), etc...

I think that Maui does it (at least, it understands the multiplication of ppn
by vmem).

> From the description I'm guessing that my patch already does what you want but
> instead of killing the jobs when they reach the node, mine already rejects the
> run request (so the job is never run in the first place).

No, that is the thing that I completely want to avoid: no scheduling decisions
must be made basing on the transparent resource limits (server/queue
configuration attribute leaf resource_limits) and job reject _is_ the
scheduling decision.  What I need is to say "If that job _in the process of its
execution_ exceeds the specified limit, kill it".  It is ulimit on steroids or
"MOM-powered per-queue ulimit over the Torque protocol" (tm).


The real reason why I created that patch is that our Grid cluster was drowned
with the jobs that ate 15-25 Gb of virtual memory and, given that we mostly
have 8 slot machines, OOM killer was pretty busy on them; so busy that some
kernel threads weren't waked up for 3-4 minutes.

But when I tried to use resources_max/resources_default, Maui started to
underfill our slots, because resources_max/resources_default are transformed to
the job requirements and not only enforced on the MOM side.  So, the codename
"transparent" was born ;))
Comment 6 Simon Toth 2010-10-06 16:06:06 MDT
> No, that is the thing that I completely want to avoid: no scheduling decisions
> must be made basing on the transparent resource limits (server/queue
> configuration attribute leaf resource_limits) and job reject _is_ the
> scheduling decision.  What I need is to say "If that job _in the process of its
> execution_ exceeds the specified limit, kill it".  It is ulimit on steroids or
> "MOM-powered per-queue ulimit over the Torque protocol" (tm).

Why would you want to do that? That's super ineffective. You will allow the job
grow over the limit, but kill it when it happens?

> The real reason why I created that patch is that our Grid cluster was drowned
> with the jobs that ate 15-25 Gb of virtual memory and, given that we mostly
> have 8 slot machines, OOM killer was pretty busy on them; so busy that some
> kernel threads weren't waked up for 3-4 minutes.

Well, why don't you limit the amount of the the memory in the first place?

> But when I tried to use resources_max/resources_default, Maui started to
> underfill our slots, because resources_max/resources_default are transformed to
> the job requirements and not only enforced on the MOM side.  So, the codename
> "transparent" was born ;))

Well, that's definitely a Maui configuration problem and has pretty much
nothing with Torque. Not a very good idea to fix a Maui configuration problem
with a patch for Torque :-D
Comment 7 Eygene Ryabinkin 2010-10-06 23:58:15 MDT
(In reply to comment #6)
> > No, that is the thing that I completely want to avoid: no scheduling decisions
> > must be made basing on the transparent resource limits (server/queue
> > configuration attribute leaf resource_limits) and job reject _is_ the
> > scheduling decision.  What I need is to say "If that job _in the process of its
> > execution_ exceeds the specified limit, kill it".  It is ulimit on steroids or
> > "MOM-powered per-queue ulimit over the Torque protocol" (tm).
> 
> Why would you want to do that?

Why would I want to do what?  Had you ever tuned ulimits on the machines, say,
via /etc/security/limits.conf?  I just need to enforce resource limits -- all I
want.

> That's super ineffective.

Please, explain your point.

> You will allow the job grow over the limit, but kill it when it happens?

Yes, and that is called limit enforcement.  By the way, that is how the law
enforcement works: prior to arresting someone, he should violate something, not
the other way round (in a perfect world, of course ;))

Once again: Grid jobs are coming without any clues (for the Torque) on their
memory requirements.
So, I know that by the SLA and some empirical knowldege, I should give no more
than 4gb of virtual memory; so I will enforce this limit: any job that takes
more vmem will be killed.

You might say that this is ineffective, that jobs shouldn't be let executed at
all, but I can't predict that the job will go over the limit at the time of its
submission: crime first, punishment second.

> > The real reason why I created that patch is that our Grid cluster was drowned
> > with the jobs that ate 15-25 Gb of virtual memory and, given that we mostly
> > have 8 slot machines, OOM killer was pretty busy on them; so busy that some
> > kernel threads weren't waked up for 3-4 minutes.
> 
> Well, why don't you limit the amount of the the memory in the first place?

Via what means?

> > But when I tried to use resources_max/resources_default, Maui started to
> > underfill our slots, because resources_max/resources_default are transformed to
> > the job requirements and not only enforced on the MOM side.  So, the codename
> > "transparent" was born ;))
> 
> Well, that's definitely a Maui configuration problem and has pretty much
> nothing with Torque.

I am sorry, but you're plain wrong.
Maui does what it should do: it evaluates job requirements and selects slots
based on them.
The problem is that I just don't want _administrator-set_ resource limits to be
treated as the job requirements.
If job additionally specify the requirements -- let it be, scheduler should
obey the requests (if they aren't higher than resources_max).

> Not a very good idea to fix a Maui configuration problem with a patch for Torque :-D

I am sorry for being a bit harsh, but given that you hadn't seen my Torque and
Maui configuration, you can't judge if there are some configuration problems in
it.

Once again: if job itself specifies the limit -- let it be, Maui should respect
it and choose the slot that fulfills the requirement.  But _not every job_ will
want, say 4gb of vmem, that't the problem.  Some of them will want only 1gb, so
enforcing scheduled to find the slots with 4gb of free vmem for such job --
that's ineffective, because I want all our job slots to be populated with
tasks.  I know that the average memory consumption for the job is 2gb, so I am
setting the 4gb cap to filter outrageous jobs.

If you have an idea how to do it with Torque/Maui combo without using my patch
and fulfilling the requirement of job being able to specify its own resource
requirements -- I am all ears.
Comment 8 Simon Toth 2010-10-07 01:51:27 MDT
(In reply to comment #7)
> (In reply to comment #6)
> > > No, that is the thing that I completely want to avoid: no scheduling decisions
> > > must be made basing on the transparent resource limits (server/queue
> > > configuration attribute leaf resource_limits) and job reject _is_ the
> > > scheduling decision.  What I need is to say "If that job _in the process of its
> > > execution_ exceeds the specified limit, kill it".  It is ulimit on steroids or
> > > "MOM-powered per-queue ulimit over the Torque protocol" (tm).
> > 
> > Why would you want to do that?
> 
> Why would I want to do what?  Had you ever tuned ulimits on the machines, say,
> via /etc/security/limits.conf?  I just need to enforce resource limits -- all I
> want.

There are tons of possible approaches. Torque supports only ulimit as far as I
know.

> > That's super ineffective.
> 
> Please, explain your point.
> 
> > You will allow the job grow over the limit, but kill it when it happens?
> 
> Yes, and that is called limit enforcement.  By the way, that is how the law
> enforcement works: prior to arresting someone, he should violate something, not
> the other way round (in a perfect world, of course ;))

No, it definitely doesn't work this way. What you are doing is drawing an
invisible line (in place where you should build a fence) and shooting everyone
that crosses the line.

> Once again: Grid jobs are coming without any clues (for the Torque) on their
> memory requirements.

One of the problems, but OK, this one might not be solvable.

> So, I know that by the SLA and some empirical knowldege, I should give no more
> than 4gb of virtual memory; so I will enforce this limit: any job that takes
> more vmem will be killed.
> 
> You might say that this is ineffective, that jobs shouldn't be let executed at
> all, but I can't predict that the job will go over the limit at the time of its
> submission: crime first, punishment second.

First, jobs should declare that, second I'm not talking about submission, I'm
talking about limiting (not killing during runtime).

> > > The real reason why I created that patch is that our Grid cluster was drowned
> > > with the jobs that ate 15-25 Gb of virtual memory and, given that we mostly
> > > have 8 slot machines, OOM killer was pretty busy on them; so busy that some
> > > kernel threads weren't waked up for 3-4 minutes.
> > 
> > Well, why don't you limit the amount of the the memory in the first place?
> 
> Via what means?

Cgroups, ulimit, virtual machines... Millions of choices.

> > > But when I tried to use resources_max/resources_default, Maui started to
> > > underfill our slots, because resources_max/resources_default are transformed to
> > > the job requirements and not only enforced on the MOM side.  So, the codename
> > > "transparent" was born ;))
> > 
> > Well, that's definitely a Maui configuration problem and has pretty much
> > nothing with Torque.
> 
> I am sorry, but you're plain wrong.
> Maui does what it should do: it evaluates job requirements and selects slots
> based on them.
> The problem is that I just don't want _administrator-set_ resource limits to be
> treated as the job requirements.

And that's a configuration issue in Maui and not Torque.

> If job additionally specify the requirements -- let it be, scheduler should
> obey the requests (if they aren't higher than resources_max).

Again, nothing to do with Torque, pure Maui configuration.

> > Not a very good idea to fix a Maui configuration problem with a patch for Torque :-D
> 
> I am sorry for being a bit harsh, but given that you hadn't seen my Torque and
> Maui configuration, you can't judge if there are some configuration problems in
> it.
>
> Once again: if job itself specifies the limit -- let it be, Maui should respect
> it and choose the slot that fulfills the requirement.  But _not every job_ will
> want, say 4gb of vmem, that't the problem.  Some of them will want only 1gb, so
> enforcing scheduled to find the slots with 4gb of free vmem for such job --
> that's ineffective, because I want all our job slots to be populated with
> tasks.  I know that the average memory consumption for the job is 2gb, so I am
> setting the 4gb cap to filter outrageous jobs.

Why would any scheduler allocate 4GB for a job that requests 1GB?

> If you have an idea how to do it with Torque/Maui combo without using my patch
> and fulfilling the requirement of job being able to specify its own resource
> requirements -- I am all ears.

You should ask that in Maui mailing list, not here.
Comment 9 Eygene Ryabinkin 2010-10-07 03:26:00 MDT
Simon, let us stop this conversation: you don't understand what I want and I am
not going to explain it once again (what you're missing is that our jobs aren't
requesting any resources from Torque, because they just don't know how much
memory they are going to use).

I can use ulimit and cgroups, but it is simpler to use central configuration in
Torque than to configure hundreds of nodes.
Comment 10 Simon Toth 2010-10-07 08:25:59 MDT
> Simon, let us stop this conversation: you don't understand what I want and I am
> not going to explain it once again (what you're missing is that our jobs aren't
> requesting any resources from Torque, because they just don't know how much
> memory they are going to use).

Yes I understand that your jobs do not have resource requests on them. You can
even read my reaction towards this in the previous comment: "One of the
problems, but OK, this one might not be solvable."

What you seem to totally ignore is the correct way to approach this. You can
configure the server to add resource limitation to jobs (which will then be
enforced using ulimit) and YOU need to configure YOUR scheduler to understand
this correctly. In your case it means "ignore vmem resource".

Once again this is a scheduler configuration problem. Your scheduler is not
ignoring what you want him to ignore but instead of configuring the scheduler
you wrote a patch for Torque.

> I can use ulimit and cgroups, but it is simpler to use central configuration in
> Torque than to configure hundreds of nodes.

I am talking about central configuration on the server!

Ulimit is already supported in Torque. If you would implement cgroup support
into Torque instead of this patch it would be awesome, because cgroups give you
much more control over the limits.
Comment 11 Simon Toth 2010-10-07 08:33:05 MDT
(In reply to comment #9)
> Simon, let us stop this conversation: you don't understand what I want and I am
> not going to explain it once again (what you're missing is that our jobs aren't
> requesting any resources from Torque, because they just don't know how much
> memory they are going to use).
> 
> I can use ulimit and cgroups, but it is simpler to use central configuration in
> Torque than to configure hundreds of nodes.

As a side note: A patch that would add server flag or node configuration option
modifying the logic from "limit using ulimit" to "kill the job" would be
totally OK (although I think this approach is just weird). But it has to work
with the normal resources.
Comment 12 Eygene Ryabinkin 2010-10-07 10:38:19 MDT
Once again, I am asking you to stop spamming -- you don't understand the point
of the patch.
If you have something to explain to me -- do it privately, don't let others to
read this stupid conversation when both me and you are just repeating our
statements.

You seem to have a formed opinion on how to schedule jobs and how to allocate
resources.  What you can't understand is that all use cases can't be described
by your view on the things -- there are many different situations and many
different scheduling and resource usage policies, especially in a mixed
environments where multi-organizational jobs should be processed.

Becides, you seem not to be familiar with my patch you're trying to criticize. 
So, what't the point?  You think that the stuff I am trying to implement is
doable without this patch?  OK, prove it, don't say "You should ask that in
Maui mailing list, not here".  Or, if you can't prove, don't throw assertions
like "Not a very good idea to fix a Maui configuration problem with a patch for
Torque", especially when you're saying that you're not an expert in Maui/Moab,
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=67#c22

(In reply to comment #10)
> What you seem to totally ignore is the correct way to approach this. You can
> configure the server to add resource limitation to jobs (which will then be
> enforced using ulimit) and YOU need to configure YOUR scheduler to understand
> this correctly. In your case it means "ignore vmem resource".

Scheduler shouldn't ignore the 'vmem' resource if user was asked for it -- that
will be plain wrong.  And your "correct" approach won't solve this problem: I
need the scheduler to consider the 'vmem' and other attributes when they were
explicitly requested, but I also need to limit the total 'vmem' consumption in
any case.  It isn't doable if one is using resources_max/resources_default,
because there is no way to understand who set the requirement -- user or
'resources_default': this will be just the requirement for the scheduler.  And
I can't turn it off, as explained in the beginning of this paragraph.

> Once again this is a scheduler configuration problem. Your scheduler is not
> ignoring what you want him to ignore but instead of configuring the scheduler
> you wrote a patch for Torque.

Once again, you just don't understand the nature of the patch.

> I am talking about central configuration on the server!
> 
> Ulimit is already supported in Torque.

You seem to be unaware what is the difference between 'vmem' and 'pvmem'. 
There is no sane way to limit 'vmem' using ulimit.  Please, go and read the
sources -- mom_set_limits(), mom_over_limit() and mom_do_poll() within the
arch-dependent mom_mach.c should be very enlightening.  Or, at least, glance
over http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml

And this "ulimit implementation" has a side effects on the scheduling, as was
explained a number of times.

> If you would implement cgroup support into Torque instead of this patch it would be awesome, because cgroups give you much more control over the limits.

Please, stop telling me what I should do, especially in such a way.  May be I
will implement cgroup support, if I will need this, but certainly I won't do it
if someone tells me "Hey, you!  Instead of doing your patches, implement this
brilliant idea".  Communication is a skill and you seem to be lacking in that
area.
Comment 13 Simon Toth 2010-10-07 14:07:29 MDT
Oh, yes, personal attacks, I haven't seem those in some time. :-)

> Scheduler shouldn't ignore the 'vmem' resource if user was asked for it -- that
> will be plain wrong.  And your "correct" approach won't solve this problem: I
> need the scheduler to consider the 'vmem' and other attributes when they were
> explicitly requested, but I also need to limit the total 'vmem' consumption in
> any case.  It isn't doable if one is using resources_max/resources_default,
> because there is no way to understand who set the requirement -- user or
> 'resources_default': this will be just the requirement for the scheduler.  And
> I can't turn it off, as explained in the beginning of this paragraph.

Then you want to configure the scheduler to check just the max limit and not
calculate the used amount. It so damn simple.

OK, I will end this here as requested. I have already expressed all my opinions
anyway.