[torquedev] Torque 2.5.8 - memory leaks

Lukasz Flis l.flis at cyf-kr.edu.pl
Tue Sep 27 17:15:34 MDT 2011


Hi,

We are running medium cluster with Torque and Moab,
Average number of jobs is usually around 4k
Number of nodes: 950
Number of cores (at the moment) 10k

We have recently migrated from 2.4.12 to 2.5.8. Unfortunately we are 
observing torque memory issues. I'm attaching daily graph from our 
ganglia monitoring system. On average it takes 3 hours for torque to 
consume 8 Gbytes of memory. Afterwards pbs_server daemon needs to be 
restarted as it is unable to perform fork operation due to lack of 
available memory, then OOM killer gets in action.

  pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job, fork 
failed

Due to the scale of the system I am unable to run pbs_server under 
valgrind to find the source of leak. I did some testing on our test 
cluster but i'm not sure how accurate results valgrind provides:

a lot of messages point to decode_str function:
==31895== 41 bytes in 4 blocks are definitely lost in loss record 40 of 81
==31895==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==31895==    by 0x452BD8: decode_str (attr_fn_str.c:144)
==31895==    by 0x40A768: recov_attr (attr_recov.c:512)
==31895==    by 0x44909C: svr_recov (svr_recov.c:204)
==31895==    by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
==31895==    by 0x4213A2: main (pbsd_main.c:1465)

so as to decode_arst_direct:

==31895== 143 (56 direct, 87 indirect) bytes in 1 blocks are definitely 
lost in loss record 55 of 84
==31895==    at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==31895==    by 0x44F1B8: decode_arst_direct (attr_fn_arst.c:189)
==31895==    by 0x44F3DE: decode_arst (attr_fn_arst.c:311)
==31895==    by 0x40A768: recov_attr (attr_recov.c:512)
==31895==    by 0x44909C: svr_recov (svr_recov.c:204)
==31895==    by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
==31895==    by 0x4213A2: main (pbsd_main.c:1465)


I am going to dig the sources a bit and see if memory allocated by above 
functions is freed properly.
However any suggestions and hints will be welcome as I might be unable 
to fix it all myself.

Thank you for attention
--
Lukasz Flis


-------------- next part --------------
A non-text attachment was scrubbed...
Name: leaking-torque.png
Type: image/png
Size: 58057 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20110928/1a3bb9a0/attachment-0001.png 


More information about the torquedev mailing list