[torqueusers] Short of physical memory, crash?
Coyle, James J [ITACD]
jjc at iastate.edu
Fri Dec 21 08:22:55 MST 2012
The crash will happen only if all physical memory + swap space is exceeded, and the out-of-memory
(oom) process killer (See http://linux-mm.org/OOM_Killer) may save the node by killing exceptionally
huge processes. You cab check the anount iof swapspace on a bnode via the
Linux command. If there is sufficient swapspace + physical memory, for both your program, the pbs_mom
and other system processes, then there should be no crash, but things may slow down quite a bit.
If your processes really need 4.5GB, then use vmem=4608MB,pmem=4608MB.
This should allow 10 on a node.
If you submit with a reservation less than what will be used, expect problems (slowness probably).
If you do so, run at least two of the commands with a nice parameter of 18 or 19
This will allow the OS and paging system to get more CPU cycles, and hence be able to respond
#PBS -l nodes=1:ppn=1,vmem=4GB,pmem=4GB,mem=4GB,walltime=48:00:00
nice +19 ./a.out
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Tian, Dong
Sent: Thursday, December 20, 2012 5:36 PM
To: Torque Users Mailing List
Subject: [torqueusers] Short of physical memory, crash?
I have the following question as a cluster user. My job is to submit jobs to the cluster to do simulations. Forgive me if my question sound simple. :-)
In one example, on one compute node, there are 48 GB RAM, 12 cores/CPUs. If each job take <4GB RAM, there should be no any issue to run 12 jobs on one node.
Now the problem is that one job takes 4.5 GB physical RAM at peak, say as reported by qstat -f. If 12 such jobs are submitted and running on one compute node. Are there any risks to crash down the compute node? Let us assume the job program is written in a safe manner.
My understanding is that the compute node may crash from the shortage of memory, but want to have confirmation from you guys.
Appreciate your time!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers