[torquedev] ulimit/setrlimit doesn't enforce RLIMIT_DATA on Linux

Eygene Ryabinkin rea+maui at grid.kiae.ru
Tue Oct 19 09:41:58 MDT 2010

Tue, Oct 19, 2010 at 04:00:52PM +0200, "Mgr. ?imon T?th" wrote:
> >>> But this won't guarantee the semantics of "mem" as being the total
> >>> limit for the summed memory usages over the whole job.  So, the patch
> >>> at http://www.supercluster.org/pipermail/torqueusers/attachments/20100318/315367b2/attachment.obj
> >>> should be really considered.
> >>
> >> Yes, that would make sense. But if you check the get_proc_stat and trace
> >> back to higher functions, you will see that again, the job should be
> >> able to exit the limit simply by forking. So it helps a little, but not
> >> really.
> >>
> >> Again it might be handled elsewhere.
> > 
> > The mentioned patch modifies mom_over_limit, so it will enforce that
> > the summed memory usage won't go above "mem" limit.
> Yes but that patch is using resi_sum() which is using get_proc_stat().

resi_sum() does not use get_proc_stat(), it crawls over proc_array and
sums up the rss usage basing on the session identifier.  You may argue
that it is simple to escape the session.  Yes, it is, but only
undeniable labels (like cgroups) can really help here.

> > Sure.  I just wanted to say that if there is no fixed order of "mem"
> > and "pmem" in job resources, then what will really limit the things
> > via setrlimit is not defined.  In reality, the limit should be set
> > to min("mem", "pmem").
> The thing is that the order doesn't make sense. Mem is limited per job,
> pmem is system enforced per process. Effectively if mem is not set, but
> pmem is then mem = pmem*ppn. If mem is set but pmem is not then
> obviously pmem = mem.

From the point of view of setrlimit, both "mem" and "pmem" set it.
But the last invocation of setrlimit will win (mom still runs
as root at that point), so the order plays role here.

> If both are set, mem is still limit per job, pmem still per process.
> For example: mem=4G pmem=2G ppn=4. You cant have 4 processes each with
> 1,5G memory (because mem limit) and you cant have one process with 3G
> and three with 0,1G (because pmem limit).

Yes, and the above code will set different rlimits,
 - "mem" comes first, "pmem" -- second: resulting rlimit will be
   2Gb per process;
 - "pmem" comes first, "mem" -- second: resulting rlimit will be
   4Gb per process.
"rlimit" here means "system-enforced resource limit".

> > Yes, modulo mmap stuff that is governed by the RLIMIT_AS that is
> > set by the __GATECH ;))
> Yes, but it shouldn't RLIMIT_AS is vmem not mem.

It can't stop using RLIMIT_AS (well, it can, but then such limiter
will be useless) -- users can use mmap(MAP_ANON) and escape the limit
if it is not set via RLIMIT_AS.

> > People will need to run external schedulers with Torque, so it should
> > leave the possibilities to adopt its semantics to the scheduler's one
> > (and vice-versa, for an ideal world).
> External schedulers should just request job runs without any resource
> semantics at all. Simply request a run with exec host specified. From
> what I have been told, most do.

Nope, scheduler should consider resource usage to get the "best fit" and
"best utilization", whatever this will mean for the local administrator.
They just can't really rely on the batch server for this -- it is not
the business of the batch server to decide where the job should be
executed, it is the work of the scheduler (even if it lives within the
batch server).  What batch server can do is check that the job constraints
allow the particular job to run on the set of nodes that were chosen
by the scheduler.  But this should also be configurable, because
often batch server is "too smart" and is preventing scheduler from
doing the proper stuff.

Think of batch server as just a dumb job transport that can additionally
set the limits on the target resources, but it shouldn't really decide
if the job is eligible to run on the particular node -- it's a scheduler
job.  Or, at least, such behaviour should be configurable.

Of course, the reality is a bit more complicated, since there are
routing queues that effectively restrict (in some configurations) the
set of resources on which the job will be able to run.  So, really,
people tend to distribute scheduler's duties amongst the scheduler
and a batch system.
Eygene Ryabinkin, Russian Research Centre "Kurchatov Institute"

More information about the torquedev mailing list