Bugzilla – Bug 86
Implement transparent resource limits
Last modified: 2010-10-07 14:07:29 MDT
You need to log in before you can comment on or make changes to this bug.
The comments in the attached patch speak for themselves, but in short, I wanted to implement server/queue resource limits that are enforced on the MOM side, but aren't passed to the scheduler and, thus, doesn't affect scheduling process. My real-world example is the vmem limitation when I just want to kill outrageous jobs and don't want our Maui to consider virtual memory requirements in the scheduling process, as it would do if I will use resources_default/resources_max. This patch is currently being tested on our production cluster running Torque 2.5.2. No problems were spotted up to date.
Created an attachment (id=54) [details] Patch that implements transparent resource limiting
Is it compatible with generic resources support? http://www.clusterresources.com/bugzilla/show_bug.cgi?id=67 I'm asking instead of going through the patch, because the patch is rather long.
(In reply to comment #2) > Is it compatible with generic resources support? > > http://www.clusterresources.com/bugzilla/show_bug.cgi?id=67 > > I'm asking instead of going through the patch, because the patch is rather > long. Your patches in #67 are long too, so I had just went through the initial explanations in this ticket. It seems to me that the patches are completely orthogonal, because what mine does is it just transfers server or per-queue resource limits to the MOM as an additional job attribute and them MOM checks if these resource limits aren't exceeded by the job. Your patch seem to add the ability to specify the node properties (read, resources). Mine doesn't care about it, it just enforces the limits and kills the jobs. Feel free to correct me, because I don't understand what your patch does, even having read all comments in #67 :(
(In reply to comment #3) > (In reply to comment #2) > > Is it compatible with generic resources support? > > > > http://www.clusterresources.com/bugzilla/show_bug.cgi?id=67 > > > > I'm asking instead of going through the patch, because the patch is rather > > long. > > Your patches in #67 are long too, so I had just went through the initial > explanations in this ticket. It seems to me that the patches are completely > orthogonal, because what mine does is it just transfers server or per-queue > resource limits to the MOM as an additional job attribute and them MOM checks > if these resource limits aren't exceeded by the job. > > Your patch seem to add the ability to specify the node properties (read, > resources). Mine doesn't care about it, it just enforces the limits and kills > the jobs. > > Feel free to correct me, because I don't understand what your patch does, even > having read all comments in #67 :( Well, the server doesn't have any idea what a resource is (right now). You can specify resources, but the server is pretty much oblivious to their existence with the exception of resource limits on queues an server (which are enforced). This adds all the support around resources that makes sense. Like also checking the nodespec for resource requests, multiplying requests that are per-proces by the correct value (ppn=2:vmem=2G ->4G), etc... From the description I'm guessing that my patch already does what you want but instead of killing the jobs when they reach the node, mine already rejects the run request (so the job is never run in the first place).
(In reply to comment #4) > Well, the server doesn't have any idea what a resource is (right now). You can > specify resources, but the server is pretty much oblivious to their existence > with the exception of resource limits on queues an server (which are enforced). May be the plain Torque server isn't aware of resources, but I am always using Torque/Maui combo and Maui certainly knows that are the resources and how to schedule the things basing on the reported resources. > This adds all the support around resources that makes sense. Like also checking > the nodespec for resource requests, multiplying requests that are per-proces by > the correct value (ppn=2:vmem=2G ->4G), etc... I think that Maui does it (at least, it understands the multiplication of ppn by vmem). > From the description I'm guessing that my patch already does what you want but > instead of killing the jobs when they reach the node, mine already rejects the > run request (so the job is never run in the first place). No, that is the thing that I completely want to avoid: no scheduling decisions must be made basing on the transparent resource limits (server/queue configuration attribute leaf resource_limits) and job reject _is_ the scheduling decision. What I need is to say "If that job _in the process of its execution_ exceeds the specified limit, kill it". It is ulimit on steroids or "MOM-powered per-queue ulimit over the Torque protocol" (tm). The real reason why I created that patch is that our Grid cluster was drowned with the jobs that ate 15-25 Gb of virtual memory and, given that we mostly have 8 slot machines, OOM killer was pretty busy on them; so busy that some kernel threads weren't waked up for 3-4 minutes. But when I tried to use resources_max/resources_default, Maui started to underfill our slots, because resources_max/resources_default are transformed to the job requirements and not only enforced on the MOM side. So, the codename "transparent" was born ;))
> No, that is the thing that I completely want to avoid: no scheduling decisions > must be made basing on the transparent resource limits (server/queue > configuration attribute leaf resource_limits) and job reject _is_ the > scheduling decision. What I need is to say "If that job _in the process of its > execution_ exceeds the specified limit, kill it". It is ulimit on steroids or > "MOM-powered per-queue ulimit over the Torque protocol" (tm). Why would you want to do that? That's super ineffective. You will allow the job grow over the limit, but kill it when it happens? > The real reason why I created that patch is that our Grid cluster was drowned > with the jobs that ate 15-25 Gb of virtual memory and, given that we mostly > have 8 slot machines, OOM killer was pretty busy on them; so busy that some > kernel threads weren't waked up for 3-4 minutes. Well, why don't you limit the amount of the the memory in the first place? > But when I tried to use resources_max/resources_default, Maui started to > underfill our slots, because resources_max/resources_default are transformed to > the job requirements and not only enforced on the MOM side. So, the codename > "transparent" was born ;)) Well, that's definitely a Maui configuration problem and has pretty much nothing with Torque. Not a very good idea to fix a Maui configuration problem with a patch for Torque :-D
(In reply to comment #6) > > No, that is the thing that I completely want to avoid: no scheduling decisions > > must be made basing on the transparent resource limits (server/queue > > configuration attribute leaf resource_limits) and job reject _is_ the > > scheduling decision. What I need is to say "If that job _in the process of its > > execution_ exceeds the specified limit, kill it". It is ulimit on steroids or > > "MOM-powered per-queue ulimit over the Torque protocol" (tm). > > Why would you want to do that? Why would I want to do what? Had you ever tuned ulimits on the machines, say, via /etc/security/limits.conf? I just need to enforce resource limits -- all I want. > That's super ineffective. Please, explain your point. > You will allow the job grow over the limit, but kill it when it happens? Yes, and that is called limit enforcement. By the way, that is how the law enforcement works: prior to arresting someone, he should violate something, not the other way round (in a perfect world, of course ;)) Once again: Grid jobs are coming without any clues (for the Torque) on their memory requirements. So, I know that by the SLA and some empirical knowldege, I should give no more than 4gb of virtual memory; so I will enforce this limit: any job that takes more vmem will be killed. You might say that this is ineffective, that jobs shouldn't be let executed at all, but I can't predict that the job will go over the limit at the time of its submission: crime first, punishment second. > > The real reason why I created that patch is that our Grid cluster was drowned > > with the jobs that ate 15-25 Gb of virtual memory and, given that we mostly > > have 8 slot machines, OOM killer was pretty busy on them; so busy that some > > kernel threads weren't waked up for 3-4 minutes. > > Well, why don't you limit the amount of the the memory in the first place? Via what means? > > But when I tried to use resources_max/resources_default, Maui started to > > underfill our slots, because resources_max/resources_default are transformed to > > the job requirements and not only enforced on the MOM side. So, the codename > > "transparent" was born ;)) > > Well, that's definitely a Maui configuration problem and has pretty much > nothing with Torque. I am sorry, but you're plain wrong. Maui does what it should do: it evaluates job requirements and selects slots based on them. The problem is that I just don't want _administrator-set_ resource limits to be treated as the job requirements. If job additionally specify the requirements -- let it be, scheduler should obey the requests (if they aren't higher than resources_max). > Not a very good idea to fix a Maui configuration problem with a patch for Torque :-D I am sorry for being a bit harsh, but given that you hadn't seen my Torque and Maui configuration, you can't judge if there are some configuration problems in it. Once again: if job itself specifies the limit -- let it be, Maui should respect it and choose the slot that fulfills the requirement. But _not every job_ will want, say 4gb of vmem, that't the problem. Some of them will want only 1gb, so enforcing scheduled to find the slots with 4gb of free vmem for such job -- that's ineffective, because I want all our job slots to be populated with tasks. I know that the average memory consumption for the job is 2gb, so I am setting the 4gb cap to filter outrageous jobs. If you have an idea how to do it with Torque/Maui combo without using my patch and fulfilling the requirement of job being able to specify its own resource requirements -- I am all ears.
(In reply to comment #7) > (In reply to comment #6) > > > No, that is the thing that I completely want to avoid: no scheduling decisions > > > must be made basing on the transparent resource limits (server/queue > > > configuration attribute leaf resource_limits) and job reject _is_ the > > > scheduling decision. What I need is to say "If that job _in the process of its > > > execution_ exceeds the specified limit, kill it". It is ulimit on steroids or > > > "MOM-powered per-queue ulimit over the Torque protocol" (tm). > > > > Why would you want to do that? > > Why would I want to do what? Had you ever tuned ulimits on the machines, say, > via /etc/security/limits.conf? I just need to enforce resource limits -- all I > want. There are tons of possible approaches. Torque supports only ulimit as far as I know. > > That's super ineffective. > > Please, explain your point. > > > You will allow the job grow over the limit, but kill it when it happens? > > Yes, and that is called limit enforcement. By the way, that is how the law > enforcement works: prior to arresting someone, he should violate something, not > the other way round (in a perfect world, of course ;)) No, it definitely doesn't work this way. What you are doing is drawing an invisible line (in place where you should build a fence) and shooting everyone that crosses the line. > Once again: Grid jobs are coming without any clues (for the Torque) on their > memory requirements. One of the problems, but OK, this one might not be solvable. > So, I know that by the SLA and some empirical knowldege, I should give no more > than 4gb of virtual memory; so I will enforce this limit: any job that takes > more vmem will be killed. > > You might say that this is ineffective, that jobs shouldn't be let executed at > all, but I can't predict that the job will go over the limit at the time of its > submission: crime first, punishment second. First, jobs should declare that, second I'm not talking about submission, I'm talking about limiting (not killing during runtime). > > > The real reason why I created that patch is that our Grid cluster was drowned > > > with the jobs that ate 15-25 Gb of virtual memory and, given that we mostly > > > have 8 slot machines, OOM killer was pretty busy on them; so busy that some > > > kernel threads weren't waked up for 3-4 minutes. > > > > Well, why don't you limit the amount of the the memory in the first place? > > Via what means? Cgroups, ulimit, virtual machines... Millions of choices. > > > But when I tried to use resources_max/resources_default, Maui started to > > > underfill our slots, because resources_max/resources_default are transformed to > > > the job requirements and not only enforced on the MOM side. So, the codename > > > "transparent" was born ;)) > > > > Well, that's definitely a Maui configuration problem and has pretty much > > nothing with Torque. > > I am sorry, but you're plain wrong. > Maui does what it should do: it evaluates job requirements and selects slots > based on them. > The problem is that I just don't want _administrator-set_ resource limits to be > treated as the job requirements. And that's a configuration issue in Maui and not Torque. > If job additionally specify the requirements -- let it be, scheduler should > obey the requests (if they aren't higher than resources_max). Again, nothing to do with Torque, pure Maui configuration. > > Not a very good idea to fix a Maui configuration problem with a patch for Torque :-D > > I am sorry for being a bit harsh, but given that you hadn't seen my Torque and > Maui configuration, you can't judge if there are some configuration problems in > it. > > Once again: if job itself specifies the limit -- let it be, Maui should respect > it and choose the slot that fulfills the requirement. But _not every job_ will > want, say 4gb of vmem, that't the problem. Some of them will want only 1gb, so > enforcing scheduled to find the slots with 4gb of free vmem for such job -- > that's ineffective, because I want all our job slots to be populated with > tasks. I know that the average memory consumption for the job is 2gb, so I am > setting the 4gb cap to filter outrageous jobs. Why would any scheduler allocate 4GB for a job that requests 1GB? > If you have an idea how to do it with Torque/Maui combo without using my patch > and fulfilling the requirement of job being able to specify its own resource > requirements -- I am all ears. You should ask that in Maui mailing list, not here.
Simon, let us stop this conversation: you don't understand what I want and I am not going to explain it once again (what you're missing is that our jobs aren't requesting any resources from Torque, because they just don't know how much memory they are going to use). I can use ulimit and cgroups, but it is simpler to use central configuration in Torque than to configure hundreds of nodes.
> Simon, let us stop this conversation: you don't understand what I want and I am > not going to explain it once again (what you're missing is that our jobs aren't > requesting any resources from Torque, because they just don't know how much > memory they are going to use). Yes I understand that your jobs do not have resource requests on them. You can even read my reaction towards this in the previous comment: "One of the problems, but OK, this one might not be solvable." What you seem to totally ignore is the correct way to approach this. You can configure the server to add resource limitation to jobs (which will then be enforced using ulimit) and YOU need to configure YOUR scheduler to understand this correctly. In your case it means "ignore vmem resource". Once again this is a scheduler configuration problem. Your scheduler is not ignoring what you want him to ignore but instead of configuring the scheduler you wrote a patch for Torque. > I can use ulimit and cgroups, but it is simpler to use central configuration in > Torque than to configure hundreds of nodes. I am talking about central configuration on the server! Ulimit is already supported in Torque. If you would implement cgroup support into Torque instead of this patch it would be awesome, because cgroups give you much more control over the limits.
(In reply to comment #9) > Simon, let us stop this conversation: you don't understand what I want and I am > not going to explain it once again (what you're missing is that our jobs aren't > requesting any resources from Torque, because they just don't know how much > memory they are going to use). > > I can use ulimit and cgroups, but it is simpler to use central configuration in > Torque than to configure hundreds of nodes. As a side note: A patch that would add server flag or node configuration option modifying the logic from "limit using ulimit" to "kill the job" would be totally OK (although I think this approach is just weird). But it has to work with the normal resources.
Once again, I am asking you to stop spamming -- you don't understand the point of the patch. If you have something to explain to me -- do it privately, don't let others to read this stupid conversation when both me and you are just repeating our statements. You seem to have a formed opinion on how to schedule jobs and how to allocate resources. What you can't understand is that all use cases can't be described by your view on the things -- there are many different situations and many different scheduling and resource usage policies, especially in a mixed environments where multi-organizational jobs should be processed. Becides, you seem not to be familiar with my patch you're trying to criticize. So, what't the point? You think that the stuff I am trying to implement is doable without this patch? OK, prove it, don't say "You should ask that in Maui mailing list, not here". Or, if you can't prove, don't throw assertions like "Not a very good idea to fix a Maui configuration problem with a patch for Torque", especially when you're saying that you're not an expert in Maui/Moab, http://www.clusterresources.com/bugzilla/show_bug.cgi?id=67#c22 (In reply to comment #10) > What you seem to totally ignore is the correct way to approach this. You can > configure the server to add resource limitation to jobs (which will then be > enforced using ulimit) and YOU need to configure YOUR scheduler to understand > this correctly. In your case it means "ignore vmem resource". Scheduler shouldn't ignore the 'vmem' resource if user was asked for it -- that will be plain wrong. And your "correct" approach won't solve this problem: I need the scheduler to consider the 'vmem' and other attributes when they were explicitly requested, but I also need to limit the total 'vmem' consumption in any case. It isn't doable if one is using resources_max/resources_default, because there is no way to understand who set the requirement -- user or 'resources_default': this will be just the requirement for the scheduler. And I can't turn it off, as explained in the beginning of this paragraph. > Once again this is a scheduler configuration problem. Your scheduler is not > ignoring what you want him to ignore but instead of configuring the scheduler > you wrote a patch for Torque. Once again, you just don't understand the nature of the patch. > I am talking about central configuration on the server! > > Ulimit is already supported in Torque. You seem to be unaware what is the difference between 'vmem' and 'pvmem'. There is no sane way to limit 'vmem' using ulimit. Please, go and read the sources -- mom_set_limits(), mom_over_limit() and mom_do_poll() within the arch-dependent mom_mach.c should be very enlightening. Or, at least, glance over http://www.clusterresources.com/torquedocs/2.1jobsubmission.shtml And this "ulimit implementation" has a side effects on the scheduling, as was explained a number of times. > If you would implement cgroup support into Torque instead of this patch it would be awesome, because cgroups give you much more control over the limits. Please, stop telling me what I should do, especially in such a way. May be I will implement cgroup support, if I will need this, but certainly I won't do it if someone tells me "Hey, you! Instead of doing your patches, implement this brilliant idea". Communication is a skill and you seem to be lacking in that area.
Oh, yes, personal attacks, I haven't seem those in some time. :-) > Scheduler shouldn't ignore the 'vmem' resource if user was asked for it -- that > will be plain wrong. And your "correct" approach won't solve this problem: I > need the scheduler to consider the 'vmem' and other attributes when they were > explicitly requested, but I also need to limit the total 'vmem' consumption in > any case. It isn't doable if one is using resources_max/resources_default, > because there is no way to understand who set the requirement -- user or > 'resources_default': this will be just the requirement for the scheduler. And > I can't turn it off, as explained in the beginning of this paragraph. Then you want to configure the scheduler to check just the max limit and not calculate the used amount. It so damn simple. OK, I will end this here as requested. I have already expressed all my opinions anyway.