[torquedev] ulimit/setrlimit doesn't enforce RLIMIT_DATA on Linux

"Mgr. Šimon Tóth" SimonT at mail.muni.cz
Tue Oct 19 08:00:52 MDT 2010


>>> But this won't guarantee the semantics of "mem" as being the total
>>> limit for the summed memory usages over the whole job.  So, the patch
>>> at http://www.supercluster.org/pipermail/torqueusers/attachments/20100318/315367b2/attachment.obj
>>> should be really considered.
>>
>> Yes, that would make sense. But if you check the get_proc_stat and trace
>> back to higher functions, you will see that again, the job should be
>> able to exit the limit simply by forking. So it helps a little, but not
>> really.
>>
>> Again it might be handled elsewhere.
> 
> The mentioned patch modifies mom_over_limit, so it will enforce that
> the summed memory usage won't go above "mem" limit.

Yes but that patch is using resi_sum() which is using get_proc_stat().

>>> And, here comes the other thing, since, as I understand, there is no
>>> particular order for "mem" and "pmem" in ji_wattr[JOB_ATR_resource], if
>>> both two limits are specified, then it is up to the implementation for
>>> what limit will win.  Obviously, it is "pmem" that should be the limiter
>>> for the ulimit case.  On the other hand, there is a little sense in
>>> setting "pmem" > "mem", so the smaller value should really win here.
>>
>> Actually there is pmem is per-process mem, mem is job mem. The same for
>> vmem and pvmem.
> 
> Sure.  I just wanted to say that if there is no fixed order of "mem"
> and "pmem" in job resources, then what will really limit the things
> via setrlimit is not defined.  In reality, the limit should be set
> to min("mem", "pmem").

The thing is that the order doesn't make sense. Mem is limited per job,
pmem is system enforced per process. Effectively if mem is not set, but
pmem is then mem = pmem*ppn. If mem is set but pmem is not then
obviously pmem = mem.

If both are set, mem is still limit per job, pmem still per process.

For example: mem=4G pmem=2G ppn=4. You cant have 4 processes each with
1,5G memory (because mem limit) and you cant have one process with 3G
and three with 0,1G (because pmem limit).

>>>>> 3. While torque does cancel a job if it runs over its walltime
>>>>> request, torque does nothing about jobs which run over their mem
>>>>> request. It leaves that to the scheduler to cancel.
>>>
>>> It should be also cured by the Anton's patch.  Jobs that are going over
>>> their "vmem" request are correctly killed by Torque.  I don't know if
>>> "pmem" should be considered, because it will be enforced via ulimit.
>>
>> Well, kind of enforced. malloc goes over the limit.
> 
> Yes, modulo mmap stuff that is governed by the RLIMIT_AS that is
> set by the __GATECH ;))

Yes, but it shouldn't RLIMIT_AS is vmem not mem.

>>> I think that we should decouple the system enforcers (like setrlimit
>>> and cgroups) and software enforcers (like mom_over_limit), add the
>>> specifications to the individual resource values ("pmem", "vmem", etc)
>>> on if these resources can be enforced via system means (and by what
>>> of them, say flags like USE_SETRLIMIT, USE_CGROUPS) and streamline
>>> the general logics for enforcements, because just now it looks like
>>> a code that needs to be refactored and it is better to use here the
>>> data-driven logics as much as it is possible instead of coding all
>>> cases by hand.
>>
>> Yes, well. You stumbled upon the, not very deeply buried, dead body of
>> the Torque semantics. For years, the process semantic was used as a CPU
>> count replacement which caused a lot of problems (this is just one
>> manifestation).
>>
>> I have already tried several times to make the core developers to
>> cooperate on making a clear document describing what are the precise
>> semantics (in plain Torque) of different resources. When they are
>> considered on server and when they are considered on the nodes. What is
>> per-process what is per-node and what is a per-job resource.
> 
> What do you mean by "considered on server"?  You meant "considered
> in the job allocation process" or something else?

If a scheduler specifies request for resources in the nodespec sent with
the run request, these resources should definitely be considered (server
should check if these resources are available) and system resources
should be enforced on the nodes (by system limitation).

>> If you check my patch, hanging in the bugzilla, it actually leaves all
>> these ways open (everything is set using a flag in the resource
>> definition), because the semantics are totally fuzzy right now.
>>
>> There is another level of problems caused by external schedulers (maui,
>> moab and pretty much any middleware). These usually ignore Torque
>> semantics and enforce their own (but in Torque we shouldn't really worry
>> about that, just make it possible).
> 
> Yes, external schedulers have their own view.  But should Torque have
> any built-in semantics at all?  Can't we just leave its definition to
> the Torque administrator using the flags (as in your implementation) or
> some other means?

Because it already has. Getting rid of the semantics and making
everything configurable (at buildtime/at runtime/per request) is a
definition of semantics. The problem is that there are colliding
semantics in Torque right now.

Plus if you just execute qrun, you should still get consistent results.

> People will need to run external schedulers with Torque, so it should
> leave the possibilities to adopt its semantics to the scheduler's one
> (and vice-versa, for an ideal world).

External schedulers should just request job runs without any resource
semantics at all. Simply request a run with exec host specified. From
what I have been told, most do.

-- 
Mgr. Šimon Tóth

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3366 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20101019/cf476583/attachment.bin 


More information about the torquedev mailing list