[torqueusers] Short of physical memory, crash?

Tian, Dong dong.tian at gmail.com
Fri Dec 21 11:59:56 MST 2012


Dear James, Chris, Bacchin and all,

Thanks for explaining in much details. You helped me to understand better
the memory management issues on a computing node.

Though it is interesting to use a nice parameter of 18 or 19, I conclude
that to mandate memory parameter for all job submissions should be a good
practice, otherwise the jobs without memory parameter may still jump into a
node and it may exceed the memory + swap space, which would cause problems.

Is it common to mandate memory parameter for all job submissions? I do not
want to ask other users to do any extra work, even it is just to type a few
more words.

Thanks and Happy Holiday to you,
Dong

On Fri, Dec 21, 2012 at 10:22 AM, Coyle, James J [ITACD] <jjc at iastate.edu>wrote:

>  ** **
>
>   The crash will happen only if all physical memory + swap space is
> exceeded, and the out-of-memory****
>
> (oom) process killer (See http://linux-mm.org/OOM_Killer)  may save the
> node by killing exceptionally ****
>
> huge processes.  You cab check the anount iof swapspace on a bnode via the
> ****
>
> swapon -s****
>
> Linux command.  If there is sufficient swapspace + physical memory, for
> both your program, the pbs_mom****
>
> and other system processes, then there should be no crash, but things may
> slow down quite a bit.****
>
> ** **
>
>    If your processes really need 4.5GB, then use vmem=4608MB,pmem=4608MB.*
> ***
>
> This should allow 10 on a node.****
>
> ** **
>
>    If you submit with a reservation less than what will be used, expect
> problems (slowness probably).****
>
> If you do so, run at least two of the commands with a nice parameter of 18
> or 19****
>
> This will allow the OS and paging system to get more CPU cycles, and hence
> be able to respond****
>
> marginally better.****
>
> ** **
>
> E.g.****
>
> ** **
>
> #!/bin/csh****
>
> ** **
>
> #PBS –l nodes=1:ppn=1,vmem=4GB,pmem=4GB,mem=4GB,walltime=48:00:00****
>
> ** **
>
> cd ${PBS_O_WORKDIR}****
>
>    nice +19 ./a.out****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *Tian, Dong
> *Sent:* Thursday, December 20, 2012 5:36 PM
> *To:* Torque Users Mailing List
> *Subject:* [torqueusers] Short of physical memory, crash?****
>
> ** **
>
> Dear Experts,****
>
> ** **
>
> I have the following question as a cluster user. My job is to submit jobs
> to the cluster to do simulations. Forgive me if my question sound simple.
> :-)****
>
> ** **
>
> In one example, on one compute node, there are 48 GB RAM, 12 cores/CPUs.
> If each job take <4GB RAM, there should be no any issue to run 12 jobs on
> one node. ****
>
> ** **
>
> Now the problem is that one job takes 4.5 GB physical RAM at peak, say as
> reported by qstat -f. If 12 such jobs are submitted and running on one
> compute node. Are there any risks to crash down the compute node? Let us
> assume the job program is written in a safe manner.****
>
> ** **
>
> My understanding is that the compute node may crash from the shortage of
> memory, but want to have confirmation from you guys.****
>
> ** **
>
> Appreciate your time!****
>
> ** **
>
> Thanks,****
>
> Dong****
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121221/79e5d4ae/attachment.html 


More information about the torqueusers mailing list