[torqueusers] How allow a job to use all memory on a node with cpuset enabled ?

François P-L francois.prudhomme at hotmail.fr
Tue Aug 27 07:22:18 MDT 2013


Hi,
We are encountering some problems with jobs asking too many memory.
For example, a job is asking 4 cpu and 126Gb.pbs_mom: LOG_INFO::create_job_cpuset, creating cpuset for job 235376[2]: 4 cpus (0-3), 1 mems (0)
For my test i use "stress" with the following command :stress -c 2 -t 600 --vm 2 --vm-bytes 61G
My node is with this topology :Machine (128GB)  NUMANode L#0 (P#0 64GB) + Socket L#0 + L3 L#0 (20MB)    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)    L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)    L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)    L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)    L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)  NUMANode L#1 (P#1 64GB) + Socket L#1 + L3 L#1 (20MB)    L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)    L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)    L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)    L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)    L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)    L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)    L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)    L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
After a few seconds :kernel: [517453.738199] stress invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0(...)kernel: [517453.738204] stress cpuset=235376[2] mems_allowed=0(...)
After reading qsub options, "-n" option can "solve" the problem... but it's a big waste of cpu in this case (all the node is dedicated for this job).
Is there a way to allow a job to use all memory of a node without using all cpu ?
Many thanks in advance. 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130827/25f6f11b/attachment.html 


More information about the torqueusers mailing list