[torqueusers] Job exceeding memory limits
chemadm at hamilton.edu
Tue Apr 17 10:05:26 MDT 2007
Thanks Dave. This is why I am wondering how torque checks an OS to
verify how much memory is being used. I suspect that when the job is
first being started that a lot more resources are used but after it's
underway it evens out to expected operation. I am hoping once I can find
out how torque does it that perhaps I can do the same from command line
to try to find out for myself why torque thinks that it needs so much
You bring up an interesting point... having MOM ignore resource usage
for young processes. I didn't see anything on parameters page for MOM to
configure this. Would you mind elaborating on how you did that? =).
Thanks in advance,
On Fri, 2007-04-13 at 06:22 +1000, David Singleton wrote:
> This may be an issue of forked processes temporarily giving the
> appearance of doubling the memory use. Immediately after a fork,
> parent and child look identical in memory use so double what it
> was before the fork. But with modern OS's there are no new
> physical pages consumed (because of copy-on-write). And usually
> the child is about to call exec() and wipe out the virtual
> address maps it has. But PBS will occasionally sample the job
> memory use just when it looks doubled.
> Large memory Fortran (and C and Python and ...) apps that call
> system() a lot are susceptible to this problem.
> Our solution was to, in the MOM, ignore the resource usage of
> very young processes, i.e. those less than 2 seconds old.
> But I am not sure this is your problem because I dont think
> Gaussian calls fork much - not when its crunching anyway. (It
> calls exec a lot to start new links.) I could be wrong.
> Steve Young wrote:
> > *bump*..... anyone have any ideas how torque checks how much memory an
> > application is using? If gaussian is really going outside it's
> > constraints of 7gb and trying to use more than double that amount I need
> > to try to verify this from command line. However with the tools I know
> > how to use I am not seeing this problem. So I'm a little bit stumped as
> > to how torque is "seeing" this exceeded limit. Thanks,
> > -Steve
> > On Fri, 2007-04-06 at 12:56 -0400, Steve Young wrote:
> >> Hello,
> >> I posted to the list about this a while back but still haven't figured
> >> out what is happening here. We are using torque-2.0.0p2. This example is
> >> happening on an Irix cluster but I have seem this on some of our other
> >> architectures too.
> >> A gaussian job is submitted requesting 8 cpu's with 7gb of RAM. In the
> >> input file it also states 7gb. Currently, if we just leave out the
> >> request for memory to torque the job will run as expected. However, if I
> >> do request the memory to torque I end up getting the following error:
> >> =>> PBS: job killed: mem 15968747520 exceeded limit 7516192768
> >> Terminated
> >> So now I tried increasing the request to 16gb. Again it terminates with:
> >> =>> PBS: job killed: mem 17789206528 exceeded limit 16106127360
> >> Killed
> >> So I again increase it to 17gb and now it appears to run.
> >> I realize that torque is doing what it is supposed to but I don't
> >> understand why it believes that the application is using that amount of
> >> memory. Looking at top on the machine I only see:
> >> SIZE RES
> >> 7266M 523M
> >> for memory of each of the 8 processes running. So how is torque
> >> "thinking" that it needs more than twice as much memory for this job?
> >> We really would like to be using memory requests to torque but as of yet
> >> I am unable to get past this situation. Maybe Garrick could explain how
> >> torque finds out the amount of memory that the application is using?
> >> Then I can try it from command line to verify it? Thanks in advance for
> >> any advice.
> >> -Steve
More information about the torqueusers