[torquedev] [Bug 205] New: pbs_server memory on GPU clusters.
bugzilla-daemon at supercluster.org
bugzilla-daemon at supercluster.org
Sat Jul 14 09:41:22 MDT 2012
http://www.clusterresources.com/bugzilla/show_bug.cgi?id=205
Summary: pbs_server memory on GPU clusters.
Product: TORQUE
Version: 2.5.x
Platform: All
OS/Version: All
Status: NEW
Severity: major
Priority: P5
Component: pbs_server
AssignedTo: dbeer at adaptivecomputing.com
ReportedBy: mamonski at man.poznan.pl
CC: torquedev at supercluster.org
Estimated Hours: 0.0
Created an attachment (id=111)
--> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=111)
missing free of decode_arst return value
This bug-report relates to:
http://www.supercluster.org/pipermail/torquedev/2012-July/004148.html
Lukasz stated that the leak was visible after adding a bunch of new GPU powered
machines. Also we have just faced the same problem in our center on new
60-nodes GPU cluster.
The run under valgrind in test environment revealed the following problem:
==16295== 95,004 bytes in 3,906 blocks are definitely lost in loss record 27 of
28
==16295== at 0x4A0739E: malloc (vg_replace_malloc.c:207)
==16295== by 0x4C18B2C: disrst (disrst.c:133)
==16295== by 0x4158E4: is_gpustat_get (node_manager.c:1504)
==16295== by 0x416A58: is_request (node_manager.c:2612)
==16295== by 0x41E354: do_rpp (pbsd_main.c:416)
==16295== by 0x41E3DA: rpp_request (pbsd_main.c:462)
==16295== by 0x4C3372D: wait_request (net_server.c:508)
==16295== by 0x41F6B8: main_loop (pbsd_main.c:1203)
==16295== by 0x42033C: main (pbsd_main.c:1759)
Attached suggest patch.
Cheers,
Mariusz
--
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the torquedev
mailing list