[torqueusers] job exceed memory limit without been killed

Chris Samuel chris at csamuel.org
Sat Mar 20 06:18:09 MDT 2010


On Fri, 19 Mar 2010 12:26:05 am Anton Starikov wrote:

> Which means that PBS_MOM already registered memory usage above limit and
> even updated this information on server, but didn't react and kill the
> job.
> 
> What can be wrong? Do I miss something in the config?

I think you are misunderstanding what the mem/vmem/pmem/pvmem limits in Torque 
actually do - they apply resource limits (ulimits in the shell, RLIMIT's in 
terms of kernel APIs) to the processes that are launched by pbs_mom.

The problem is that in the old days malloc() in glibc just called brk() and in 
the Linux kernel brk() obeys the RLIMIT_DATA limit which pbs_mom sets for mem 
and pmem.

But then glibc changed and now calls mmap() for allocations over a certain 
size and mmap() in the Linux kernel does not observe RLIMIT_DATA.

Perhaps the simplest fix is to translate any reference of mem or pmem to vmem 
or pvmem as they will set the RLIMIT_AS limit which is observed by 
RLIMIT_DATA, or use the Maui/Moab tricks which use the data reported by the 
node to decide whether or not to kill the job.

For more information on the various RLIMIT's see the setrlimit() manual page.

cheers!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 481 bytes
Desc: This is a digitally signed message part.
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100320/62d3c9b2/attachment.bin 


More information about the torqueusers mailing list