[torquedev] ulimit/setrlimit doesn't enforce RLIMIT_DATA on Linux

Eygene Ryabinkin rea+maui at grid.kiae.ru
Tue Oct 19 02:51:36 MDT 2010


First of all, I will hijack the thread and beg my public apologies to
Simon: in http://www.clusterresources.com/bugzilla/show_bug.cgi?id=86
I was excessively rude for a no particular reason (well, at the time
of writing I thought that there is a reason, but it turned to be untrue).
Simon, I was not right at all in my personal judgements, sorry for that!

Mon, Oct 18, 2010 at 06:00:16PM +0200, "Mgr. ?imon T?th" wrote:
> As described in:
> http://www.supercluster.org/pipermail/torqueusers/2010-October/011540.html

Original poster of the message you mention seem to be incorrect in some
of his points.  More specifically (I'll talk about Linux implementation
only, because that is what is mostly used nowadays):

>> 1. mom_over_limit()  in src/resmom/linux/mom_mach.c does NOT check
>> "mem", only vmem and pvmem. The patch that Anton Starikov attached to
>> the old thread did not make it into the source tree.

That is the partially-correct assertion: mom_set_limits() is taking care
for "mem" for the single-node jobs: it invokes setrlimit() for
RLIMIT_DATA, RLIMIT_RSS RLIMIT_STACK and RLIMIT_AS in the following code
block
{{{
    else if ((!strcmp(pname, "mem") && (pjob->ji_numnodes == 1)) ||
             !strcmp(pname, "pmem"))
      {
      if (ignmem == FALSE)
}}}
So, no single process in the single-node job will consume more than
"mem" or "pmem" of memory, even for mmap -- RLIMIT_AS is in action.
But this won't guarantee the semantics of "mem" as being the total
limit for the summed memory usages over the whole job.  So, the patch
at http://www.supercluster.org/pipermail/torqueusers/attachments/20100318/315367b2/attachment.obj
should be really considered.

And, here comes the other thing, since, as I understand, there is no
particular order for "mem" and "pmem" in ji_wattr[JOB_ATR_resource], if
both two limits are specified, then it is up to the implementation for
what limit will win.  Obviously, it is "pmem" that should be the limiter
for the ulimit case.  On the other hand, there is a little sense in
setting "pmem" > "mem", so the smaller value should really win here.

>> 2. When setting mem, pmem, vmem, pvmem in the Torque script, only
>> "pmem" actually gets translated into an rlimit ("data"). The other
>> three resources (mem, vmem, and pvmem) are ignored. If I understand
>> correctly, that's correct behavior for mem and vmem, which are summed
>> limits over all processes in the job. But I would have thought
>> setting pvmem would have set the address space (aka virtual memory)
>> limit.

min("pvmem", "vmem") is placed to a vmem_limit variable (inside
mom_set_limits()) that is subsequently used for setting RLIMIT_AS,
so there seems to be no such problem.  Though I can miss something
important here.

>> 3. While torque does cancel a job if it runs over its walltime
>> request, torque does nothing about jobs which run over their mem
>> request. It leaves that to the scheduler to cancel.

It should be also cured by the Anton's patch.  Jobs that are going over
their "vmem" request are correctly killed by Torque.  I don't know if
"pmem" should be considered, because it will be enforced via ulimit.
Also, I don't quite understand if one needs the check for "pvmem" inside
the mom_over_limit(), because it also enforced using ulimit.  But if it
is really needed, then we will need "pmem" check as well inside
mom_over_limit().

> I can confirm this issue (specifically it seems to be a problem of mmap
> not enforcing RLIMIT_DATA).

Can you describe your job's layout and requirements published to Torque
explicitely or via queue/server limits?  RLIMIT_AS seems to be in
place, but it may be missing at some points.

> I have been looking at cgroups for some time, because they allow much
> more control over system limits. But I think that we need to talk about
> the semantics a little bit + also what we want to enforce.

I think that we should decouple the system enforcers (like setrlimit
and cgroups) and software enforcers (like mom_over_limit), add the
specifications to the individual resource values ("pmem", "vmem", etc)
on if these resources can be enforced via system means (and by what
of them, say flags like USE_SETRLIMIT, USE_CGROUPS) and streamline
the general logics for enforcements, because just now it looks like
a code that needs to be refactored and it is better to use here the
data-driven logics as much as it is possible instead of coding all
cases by hand.
-- 
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"


More information about the torquedev mailing list