[torquedev] Torque 2.5.8 - memory leaks

Ken Nielson knielson at adaptivecomputing.com
Wed Sep 28 16:00:32 MDT 2011


----- Original Message -----
> From: "Lukasz Flis" <l.flis at cyf-kr.edu.pl>
> To: "Torque Developers mailing list" <torquedev at supercluster.org>
> Sent: Tuesday, September 27, 2011 5:15:34 PM
> Subject: [torquedev] Torque 2.5.8 - memory leaks
> 
> Hi,
> 
> We are running medium cluster with Torque and Moab,
> Average number of jobs is usually around 4k
> Number of nodes: 950
> Number of cores (at the moment) 10k
> 
> We have recently migrated from 2.4.12 to 2.5.8. Unfortunately we are
> observing torque memory issues. I'm attaching daily graph from our
> ganglia monitoring system. On average it takes 3 hours for torque to
> consume 8 Gbytes of memory. Afterwards pbs_server daemon needs to be
> restarted as it is unable to perform fork operation due to lack of
> available memory, then OOM killer gets in action.
> 
>   pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job,
>   fork
> failed
> 
> Due to the scale of the system I am unable to run pbs_server under
> valgrind to find the source of leak. I did some testing on our test
> cluster but i'm not sure how accurate results valgrind provides:
> 
> a lot of messages point to decode_str function:
> ==31895== 41 bytes in 4 blocks are definitely lost in loss record 40
> of 81
> ==31895==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
> ==31895==    by 0x452BD8: decode_str (attr_fn_str.c:144)
> ==31895==    by 0x40A768: recov_attr (attr_recov.c:512)
> ==31895==    by 0x44909C: svr_recov (svr_recov.c:204)
> ==31895==    by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
> ==31895==    by 0x4213A2: main (pbsd_main.c:1465)
> 
> so as to decode_arst_direct:
> 
> ==31895== 143 (56 direct, 87 indirect) bytes in 1 blocks are
> definitely
> lost in loss record 55 of 84
> ==31895==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
> ==31895==    by 0x44F1B8: decode_arst_direct (attr_fn_arst.c:189)
> ==31895==    by 0x44F3DE: decode_arst (attr_fn_arst.c:311)
> ==31895==    by 0x40A768: recov_attr (attr_recov.c:512)
> ==31895==    by 0x44909C: svr_recov (svr_recov.c:204)
> ==31895==    by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
> ==31895==    by 0x4213A2: main (pbsd_main.c:1465)
> 
> 
> I am going to dig the sources a bit and see if memory allocated by
> above
> functions is freed properly.
> However any suggestions and hints will be welcome as I might be
> unable
> to fix it all myself.
> 
> Thank you for attention
> --
> Lukasz Flis
> 
> 
What scheduler are you using?

Ken Nielson


More information about the torquedev mailing list