[torquedev] ulimit/setrlimit doesn't enforce RLIMIT_DATA on Linux

Eygene Ryabinkin rea+maui at grid.kiae.ru
Tue Oct 19 06:42:36 MDT 2010

Tue, Oct 19, 2010 at 01:47:10PM +0200, "Mgr. ?imon T?th" wrote:
> > {{{
> >     else if ((!strcmp(pname, "mem") && (pjob->ji_numnodes == 1)) ||
> >              !strcmp(pname, "pmem"))
> >       {
> >       if (ignmem == FALSE)
> > }}}
> > So, no single process in the single-node job will consume more than
> > "mem" or "pmem" of memory, even for mmap -- RLIMIT_AS is in action.
> You missed the OR statement in the condition. The "p" prefix means per
> process therefore that will be applied always.

Yes, but I had really missed that RLIMIT_AS if #ifdef'ed, because
I am running Torque with __GATECH defined ;))

> Now there is another problem (although it might be handled somewhere
> else). ji_numnodes is the number of nodes, not processes. Plus your job
> can still fork() and I'm pretty sure that this won't be handled
> correctly by Torque. Because the limits are per-process not per process
> group, then job can still run out of the limitation.

Yes, that's why I am talking about Anton's "mem" patch.

> > But this won't guarantee the semantics of "mem" as being the total
> > limit for the summed memory usages over the whole job.  So, the patch
> > at http://www.supercluster.org/pipermail/torqueusers/attachments/20100318/315367b2/attachment.obj
> > should be really considered.
> Yes, that would make sense. But if you check the get_proc_stat and trace
> back to higher functions, you will see that again, the job should be
> able to exit the limit simply by forking. So it helps a little, but not
> really.
> Again it might be handled elsewhere.

The mentioned patch modifies mom_over_limit, so it will enforce that
the summed memory usage won't go above "mem" limit.

> > And, here comes the other thing, since, as I understand, there is no
> > particular order for "mem" and "pmem" in ji_wattr[JOB_ATR_resource], if
> > both two limits are specified, then it is up to the implementation for
> > what limit will win.  Obviously, it is "pmem" that should be the limiter
> > for the ulimit case.  On the other hand, there is a little sense in
> > setting "pmem" > "mem", so the smaller value should really win here.
> Actually there is pmem is per-process mem, mem is job mem. The same for
> vmem and pvmem.

Sure.  I just wanted to say that if there is no fixed order of "mem"
and "pmem" in job resources, then what will really limit the things
via setrlimit is not defined.  In reality, the limit should be set
to min("mem", "pmem").

> >>> 3. While torque does cancel a job if it runs over its walltime
> >>> request, torque does nothing about jobs which run over their mem
> >>> request. It leaves that to the scheduler to cancel.
> > 
> > It should be also cured by the Anton's patch.  Jobs that are going over
> > their "vmem" request are correctly killed by Torque.  I don't know if
> > "pmem" should be considered, because it will be enforced via ulimit.
> Well, kind of enforced. malloc goes over the limit.

Yes, modulo mmap stuff that is governed by the RLIMIT_AS that is
set by the __GATECH ;))

> > I think that we should decouple the system enforcers (like setrlimit
> > and cgroups) and software enforcers (like mom_over_limit), add the
> > specifications to the individual resource values ("pmem", "vmem", etc)
> > on if these resources can be enforced via system means (and by what
> > of them, say flags like USE_SETRLIMIT, USE_CGROUPS) and streamline
> > the general logics for enforcements, because just now it looks like
> > a code that needs to be refactored and it is better to use here the
> > data-driven logics as much as it is possible instead of coding all
> > cases by hand.
> Yes, well. You stumbled upon the, not very deeply buried, dead body of
> the Torque semantics. For years, the process semantic was used as a CPU
> count replacement which caused a lot of problems (this is just one
> manifestation).
> I have already tried several times to make the core developers to
> cooperate on making a clear document describing what are the precise
> semantics (in plain Torque) of different resources. When they are
> considered on server and when they are considered on the nodes. What is
> per-process what is per-node and what is a per-job resource.

What do you mean by "considered on server"?  You meant "considered
in the job allocation process" or something else?

> If you check my patch, hanging in the bugzilla, it actually leaves all
> these ways open (everything is set using a flag in the resource
> definition), because the semantics are totally fuzzy right now.
> There is another level of problems caused by external schedulers (maui,
> moab and pretty much any middleware). These usually ignore Torque
> semantics and enforce their own (but in Torque we shouldn't really worry
> about that, just make it possible).

Yes, external schedulers have their own view.  But should Torque have
any built-in semantics at all?  Can't we just leave its definition to
the Torque administrator using the flags (as in your implementation) or
some other means?

People will need to run external schedulers with Torque, so it should
leave the possibilities to adopt its semantics to the scheduler's one
(and vice-versa, for an ideal world).
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"

More information about the torquedev mailing list