[torquedev] ulimit/setrlimit doesn't enforce RLIMIT_DATA on Linux

"Mgr. Šimon Tóth" SimonT at mail.muni.cz
Tue Oct 19 05:47:10 MDT 2010


> First of all, I will hijack the thread and beg my public apologies to
> Simon: in http://www.clusterresources.com/bugzilla/show_bug.cgi?id=86
> I was excessively rude for a no particular reason (well, at the time
> of writing I thought that there is a reason, but it turned to be untrue).
> Simon, I was not right at all in my personal judgements, sorry for that!

No problem.

>>> 1. mom_over_limit()  in src/resmom/linux/mom_mach.c does NOT check
>>> "mem", only vmem and pvmem. The patch that Anton Starikov attached to
>>> the old thread did not make it into the source tree.
> 
> That is the partially-correct assertion: mom_set_limits() is taking care
> for "mem" for the single-node jobs: it invokes setrlimit() for
> RLIMIT_DATA, RLIMIT_RSS RLIMIT_STACK and RLIMIT_AS in the following code
> block
> {{{
>     else if ((!strcmp(pname, "mem") && (pjob->ji_numnodes == 1)) ||
>              !strcmp(pname, "pmem"))
>       {
>       if (ignmem == FALSE)
> }}}
> So, no single process in the single-node job will consume more than
> "mem" or "pmem" of memory, even for mmap -- RLIMIT_AS is in action.

You missed the OR statement in the condition. The "p" prefix means per
process therefore that will be applied always.

You also missed the fact, that the RLIMIT_AS part is actually inside an
ifdef (which won't be invoked).

Now there is another problem (although it might be handled somewhere
else). ji_numnodes is the number of nodes, not processes. Plus your job
can still fork() and I'm pretty sure that this won't be handled
correctly by Torque. Because the limits are per-process not per process
group, then job can still run out of the limitation.

> But this won't guarantee the semantics of "mem" as being the total
> limit for the summed memory usages over the whole job.  So, the patch
> at http://www.supercluster.org/pipermail/torqueusers/attachments/20100318/315367b2/attachment.obj
> should be really considered.

Yes, that would make sense. But if you check the get_proc_stat and trace
back to higher functions, you will see that again, the job should be
able to exit the limit simply by forking. So it helps a little, but not
really.

Again it might be handled elsewhere.

> And, here comes the other thing, since, as I understand, there is no
> particular order for "mem" and "pmem" in ji_wattr[JOB_ATR_resource], if
> both two limits are specified, then it is up to the implementation for
> what limit will win.  Obviously, it is "pmem" that should be the limiter
> for the ulimit case.  On the other hand, there is a little sense in
> setting "pmem" > "mem", so the smaller value should really win here.

Actually there is pmem is per-process mem, mem is job mem. The same for
vmem and pvmem.

>>> 2. When setting mem, pmem, vmem, pvmem in the Torque script, only
>>> "pmem" actually gets translated into an rlimit ("data"). The other
>>> three resources (mem, vmem, and pvmem) are ignored. If I understand
>>> correctly, that's correct behavior for mem and vmem, which are summed
>>> limits over all processes in the job. But I would have thought
>>> setting pvmem would have set the address space (aka virtual memory)
>>> limit.
> 
> min("pvmem", "vmem") is placed to a vmem_limit variable (inside
> mom_set_limits()) that is subsequently used for setting RLIMIT_AS,
> so there seems to be no such problem.  Though I can miss something
> important here.
> 
>>> 3. While torque does cancel a job if it runs over its walltime
>>> request, torque does nothing about jobs which run over their mem
>>> request. It leaves that to the scheduler to cancel.
> 
> It should be also cured by the Anton's patch.  Jobs that are going over
> their "vmem" request are correctly killed by Torque.  I don't know if
> "pmem" should be considered, because it will be enforced via ulimit.

Well, kind of enforced. malloc goes over the limit.

> Also, I don't quite understand if one needs the check for "pvmem" inside
> the mom_over_limit(), because it also enforced using ulimit.  But if it
> is really needed, then we will need "pmem" check as well inside
> mom_over_limit().
> 
>> I can confirm this issue (specifically it seems to be a problem of mmap
>> not enforcing RLIMIT_DATA).
> 
> Can you describe your job's layout and requirements published to Torque
> explicitely or via queue/server limits?  RLIMIT_AS seems to be in
> place, but it may be missing at some points.

RLIMIT_AS is an equivalent of vmem not mem. Mem is enforced using
RLIMIT_DATA. On Linux, RLIMIT_DATA doesn't include mmap which is used by
malloc, making malloc immune to RLIMIT_DATA.

>> I have been looking at cgroups for some time, because they allow much
>> more control over system limits. But I think that we need to talk about
>> the semantics a little bit + also what we want to enforce.
> 
> I think that we should decouple the system enforcers (like setrlimit
> and cgroups) and software enforcers (like mom_over_limit), add the
> specifications to the individual resource values ("pmem", "vmem", etc)
> on if these resources can be enforced via system means (and by what
> of them, say flags like USE_SETRLIMIT, USE_CGROUPS) and streamline
> the general logics for enforcements, because just now it looks like
> a code that needs to be refactored and it is better to use here the
> data-driven logics as much as it is possible instead of coding all
> cases by hand.

Yes, well. You stumbled upon the, not very deeply buried, dead body of
the Torque semantics. For years, the process semantic was used as a CPU
count replacement which caused a lot of problems (this is just one
manifestation).

I have already tried several times to make the core developers to
cooperate on making a clear document describing what are the precise
semantics (in plain Torque) of different resources. When they are
considered on server and when they are considered on the nodes. What is
per-process what is per-node and what is a per-job resource.

If you check my patch, hanging in the bugzilla, it actually leaves all
these ways open (everything is set using a flag in the resource
definition), because the semantics are totally fuzzy right now.

There is another level of problems caused by external schedulers (maui,
moab and pretty much any middleware). These usually ignore Torque
semantics and enforce their own (but in Torque we shouldn't really worry
about that, just make it possible).

-- 
Mgr. Šimon Tóth

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3366 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20101019/2a2d6260/attachment-0001.bin 


More information about the torquedev mailing list