[torqueusers] Job exceeding memory limits
David.Singleton at anu.edu.au
Thu Apr 12 14:22:02 MDT 2007
This may be an issue of forked processes temporarily giving the
appearance of doubling the memory use. Immediately after a fork,
parent and child look identical in memory use so double what it
was before the fork. But with modern OS's there are no new
physical pages consumed (because of copy-on-write). And usually
the child is about to call exec() and wipe out the virtual
address maps it has. But PBS will occasionally sample the job
memory use just when it looks doubled.
Large memory Fortran (and C and Python and ...) apps that call
system() a lot are susceptible to this problem.
Our solution was to, in the MOM, ignore the resource usage of
very young processes, i.e. those less than 2 seconds old.
But I am not sure this is your problem because I dont think
Gaussian calls fork much - not when its crunching anyway. (It
calls exec a lot to start new links.) I could be wrong.
Steve Young wrote:
> *bump*..... anyone have any ideas how torque checks how much memory an
> application is using? If gaussian is really going outside it's
> constraints of 7gb and trying to use more than double that amount I need
> to try to verify this from command line. However with the tools I know
> how to use I am not seeing this problem. So I'm a little bit stumped as
> to how torque is "seeing" this exceeded limit. Thanks,
> On Fri, 2007-04-06 at 12:56 -0400, Steve Young wrote:
>> I posted to the list about this a while back but still haven't figured
>> out what is happening here. We are using torque-2.0.0p2. This example is
>> happening on an Irix cluster but I have seem this on some of our other
>> architectures too.
>> A gaussian job is submitted requesting 8 cpu's with 7gb of RAM. In the
>> input file it also states 7gb. Currently, if we just leave out the
>> request for memory to torque the job will run as expected. However, if I
>> do request the memory to torque I end up getting the following error:
>> =>> PBS: job killed: mem 15968747520 exceeded limit 7516192768
>> So now I tried increasing the request to 16gb. Again it terminates with:
>> =>> PBS: job killed: mem 17789206528 exceeded limit 16106127360
>> So I again increase it to 17gb and now it appears to run.
>> I realize that torque is doing what it is supposed to but I don't
>> understand why it believes that the application is using that amount of
>> memory. Looking at top on the machine I only see:
>> SIZE RES
>> 7266M 523M
>> for memory of each of the 8 processes running. So how is torque
>> "thinking" that it needs more than twice as much memory for this job?
>> We really would like to be using memory requests to torque but as of yet
>> I am unable to get past this situation. Maybe Garrick could explain how
>> torque finds out the amount of memory that the application is using?
>> Then I can try it from command line to verify it? Thanks in advance for
>> any advice.
More information about the torqueusers