Bug 205 - pbs_server memory on GPU clusters.
: pbs_server memory on GPU clusters.
Status: RESOLVED FIXED
Product: TORQUE
pbs_server
: 2.5.x
: All All
: P5 major
Assigned To: David Beer
:
:
:
  Show dependency treegraph
 
Reported: 2012-07-14 09:41 MDT by Mariusz Mamonski
Modified: 2012-07-18 12:59 MDT (History)
2 users (show)

See Also:


Attachments
missing free of decode_arst return value (824 bytes, application/octet-stream)
2012-07-14 09:41 MDT, Mariusz Mamonski
Details
Fixes a memory leak when nodes with GPUs send updates to pbs_server (733 bytes, patch)
2012-07-18 12:57 MDT, Ken Nielson
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description Mariusz Mamonski 2012-07-14 09:41:22 MDT
Created an attachment (id=111) [details]
missing free of decode_arst  return value

This bug-report relates to:

http://www.supercluster.org/pipermail/torquedev/2012-July/004148.html

Lukasz stated that the leak was visible after adding a bunch of new GPU powered
machines. Also we have just faced the same problem in our center on new
60-nodes GPU cluster.

The run under valgrind in test environment revealed the following problem:

==16295== 95,004 bytes in 3,906 blocks are definitely lost in loss record 27 of
28
==16295==    at 0x4A0739E: malloc (vg_replace_malloc.c:207)
==16295==    by 0x4C18B2C: disrst (disrst.c:133)
==16295==    by 0x4158E4: is_gpustat_get (node_manager.c:1504)
==16295==    by 0x416A58: is_request (node_manager.c:2612)
==16295==    by 0x41E354: do_rpp (pbsd_main.c:416)
==16295==    by 0x41E3DA: rpp_request (pbsd_main.c:462)
==16295==    by 0x4C3372D: wait_request (net_server.c:508)
==16295==    by 0x41F6B8: main_loop (pbsd_main.c:1203)
==16295==    by 0x42033C: main (pbsd_main.c:1759)

Attached suggest patch.

Cheers,
Mariusz
Comment 1 Ken Nielson 2012-07-18 12:57:56 MDT
Created an attachment (id=112) [details]
Fixes a memory leak when nodes with GPUs send updates to pbs_server

The previous patch missed a couple of cases for memory leaks. This one takes
care of the rest of the leaks.
Comment 2 Ken Nielson 2012-07-18 12:59:23 MDT
The patch submitted by Mariusz missed a couple of cases where memory was
leaking. I have attached another patch which fixes all the leaks.

Thanks for locating this problem and pointing us in the right direction.