Bugzilla – Bug 205
pbs_server memory on GPU clusters.
Last modified: 2012-07-18 12:59:23 MDT
You need to log in before you can comment on or make changes to this bug.
Created an attachment (id=111) [details] missing free of decode_arst return value This bug-report relates to: http://www.supercluster.org/pipermail/torquedev/2012-July/004148.html Lukasz stated that the leak was visible after adding a bunch of new GPU powered machines. Also we have just faced the same problem in our center on new 60-nodes GPU cluster. The run under valgrind in test environment revealed the following problem: ==16295== 95,004 bytes in 3,906 blocks are definitely lost in loss record 27 of 28 ==16295== at 0x4A0739E: malloc (vg_replace_malloc.c:207) ==16295== by 0x4C18B2C: disrst (disrst.c:133) ==16295== by 0x4158E4: is_gpustat_get (node_manager.c:1504) ==16295== by 0x416A58: is_request (node_manager.c:2612) ==16295== by 0x41E354: do_rpp (pbsd_main.c:416) ==16295== by 0x41E3DA: rpp_request (pbsd_main.c:462) ==16295== by 0x4C3372D: wait_request (net_server.c:508) ==16295== by 0x41F6B8: main_loop (pbsd_main.c:1203) ==16295== by 0x42033C: main (pbsd_main.c:1759) Attached suggest patch. Cheers, Mariusz
Created an attachment (id=112) [details] Fixes a memory leak when nodes with GPUs send updates to pbs_server The previous patch missed a couple of cases for memory leaks. This one takes care of the rest of the leaks.
The patch submitted by Mariusz missed a couple of cases where memory was leaking. I have attached another patch which fixes all the leaks. Thanks for locating this problem and pointing us in the right direction.