[torquedev] Torque 2.5.8 - memory leaks
Lukasz Flis
l.flis at cyf-kr.edu.pl
Tue Sep 27 17:15:34 MDT 2011
Hi,
We are running medium cluster with Torque and Moab,
Average number of jobs is usually around 4k
Number of nodes: 950
Number of cores (at the moment) 10k
We have recently migrated from 2.4.12 to 2.5.8. Unfortunately we are
observing torque memory issues. I'm attaching daily graph from our
ganglia monitoring system. On average it takes 3 hours for torque to
consume 8 Gbytes of memory. Afterwards pbs_server daemon needs to be
restarted as it is unable to perform fork operation due to lack of
available memory, then OOM killer gets in action.
pbs_server: LOG_ERROR::Cannot allocate memory (12) in send_job, fork
failed
Due to the scale of the system I am unable to run pbs_server under
valgrind to find the source of leak. I did some testing on our test
cluster but i'm not sure how accurate results valgrind provides:
a lot of messages point to decode_str function:
==31895== 41 bytes in 4 blocks are definitely lost in loss record 40 of 81
==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==31895== by 0x452BD8: decode_str (attr_fn_str.c:144)
==31895== by 0x40A768: recov_attr (attr_recov.c:512)
==31895== by 0x44909C: svr_recov (svr_recov.c:204)
==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
==31895== by 0x4213A2: main (pbsd_main.c:1465)
so as to decode_arst_direct:
==31895== 143 (56 direct, 87 indirect) bytes in 1 blocks are definitely
lost in loss record 55 of 84
==31895== at 0x4A05E1C: malloc (vg_replace_malloc.c:195)
==31895== by 0x44F1B8: decode_arst_direct (attr_fn_arst.c:189)
==31895== by 0x44F3DE: decode_arst (attr_fn_arst.c:311)
==31895== by 0x40A768: recov_attr (attr_recov.c:512)
==31895== by 0x44909C: svr_recov (svr_recov.c:204)
==31895== by 0x41F4A5: get_svr_attr (pbsd_init.c:2387)
==31895== by 0x4213A2: main (pbsd_main.c:1465)
I am going to dig the sources a bit and see if memory allocated by above
functions is freed properly.
However any suggestions and hints will be welcome as I might be unable
to fix it all myself.
Thank you for attention
--
Lukasz Flis
-------------- next part --------------
A non-text attachment was scrubbed...
Name: leaking-torque.png
Type: image/png
Size: 58057 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20110928/1a3bb9a0/attachment-0001.png
More information about the torquedev
mailing list