[torquedev] [Bug 205] New: pbs_server memory on GPU clusters.

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Sat Jul 14 09:41:22 MDT 2012


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=205

           Summary: pbs_server memory on GPU clusters.
           Product: TORQUE
           Version: 2.5.x
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: P5
         Component: pbs_server
        AssignedTo: dbeer at adaptivecomputing.com
        ReportedBy: mamonski at man.poznan.pl
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0


Created an attachment (id=111)
 --> (http://www.clusterresources.com/bugzilla/attachment.cgi?id=111)
missing free of decode_arst  return value

This bug-report relates to:

http://www.supercluster.org/pipermail/torquedev/2012-July/004148.html

Lukasz stated that the leak was visible after adding a bunch of new GPU powered
machines. Also we have just faced the same problem in our center on new
60-nodes GPU cluster.

The run under valgrind in test environment revealed the following problem:

==16295== 95,004 bytes in 3,906 blocks are definitely lost in loss record 27 of
28
==16295==    at 0x4A0739E: malloc (vg_replace_malloc.c:207)
==16295==    by 0x4C18B2C: disrst (disrst.c:133)
==16295==    by 0x4158E4: is_gpustat_get (node_manager.c:1504)
==16295==    by 0x416A58: is_request (node_manager.c:2612)
==16295==    by 0x41E354: do_rpp (pbsd_main.c:416)
==16295==    by 0x41E3DA: rpp_request (pbsd_main.c:462)
==16295==    by 0x4C3372D: wait_request (net_server.c:508)
==16295==    by 0x41F6B8: main_loop (pbsd_main.c:1203)
==16295==    by 0x42033C: main (pbsd_main.c:1759)

Attached suggest patch.

Cheers,
Mariusz

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list