[torqueusers] Torque memory allocation

Coyle, James J [ITACD] jjc at iastate.edu
Tue Apr 13 09:29:53 MDT 2010


Fan,

You probably are having problems with default settings for pmem and vmem, which you are not setting.
The defaults are probably 4GB.

I'll assume that you have nodes with 16 processors and with 16Gb of memory, (1GB/processor on average)
and that the Java app is a single process, so you are only reserving 1 processor with nodes=1:ppn=1
 so that you reservation looks something like:

#PBS -lmem=12Gb,nodes=1:ppn=1,walltime=1:00:00





  If so, I'd suggest instead using pmem and vmem also, and reserve

Enough processors on that node so that that number of processors with

the average memory will satisfy your memory needs. In this case

12GB at 1GB per processor means reserve 12 processors.



#PBS -lvmem=12GB,pmem=12Gb,mem=12Gb,nodes=1:ppn=12,walltime=1:00:00

Then 12/16 ths of the memory is being used, so reserve 12/16 ths of the cpus on that node.

So two of these jobs cannot fit onto one node, and if the process us being killed for virtual memory (vmem)
or for process size (pmem )should take care of that.

Also if you are using only a single node and using tcsh or csh, I'd place the command
unlimit stacksize
in the script before the memory intensive command (look uo the equivalent command if you are in a Bourne shell like bash)
If you use multiple nodes , put this command in your ~/.cshrc file.


 James Coyle, PhD
 High Performance Computing Group
 115 Durham Center
 Iowa State Univ.
 Ames, Iowa 50011           web: http://www.public.iastate.edu/~jjc

From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Fan Dong
Sent: Monday, April 12, 2010 9:21 PM
To: torqueusers at supercluster.org
Subject: [torqueusers] Torque memory allocation

Hi there,

I am running into a problem described as the follows:
1) we have some memory intensive java jobs to run through Torque, each of the jobs requires 12Gb of memory and each nodes in the cluster has 16Gb of memory.
2) when a job is running on one of the node, Torque does not prevent the new job (requiring 12Gb memory as well) from starting on the same node, causing that new job fails because  there is no enough memory.  (We already let Torque to scatter the jobs cross the nodes, but this will happen when there are more jobs than nodes)
3) tried use -l mem=12gb, but did not work.  Torque seems to have a 4Gb limit for this setting.

I was wondering if there is any solution for that.  We are not using Moab or Maui.

Any input is highly appreciated.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100413/d66bef6b/attachment.html 


More information about the torqueusers mailing list