[torquedev] memory leaks in torque-server 2.5.11: question

Ken Nielson knielson at adaptivecomputing.com
Mon Jul 9 11:24:21 MDT 2012


On Tue, Jul 3, 2012 at 10:21 AM, Lukasz Flis <l.flis at cyf-kr.edu.pl> wrote:

> Hi,
>
> We are running quite a medium computing site in Poland.
> Daily we process around 25k jobs - grid workloads and multi node jobs
> submitted localy.
>
> We are facing the problem with long running pbs_server process which
> after one week or two consumes all the memory available on the machine.
> As a result pbs_server is unable to spawn subprocess to unmunge
> credentials:
>
> 06/26/2012 15:58:20;0080;PBS_Server;Req;req_reject;Reject reply
> code=15012(PBS_Server System error: Inappropriate ioctl for device
> MSG=couldn't create pipe to unmunge), aux=0,
> type=AlternateUserAuthentication, from qcg-comp at someserver
> 06/26/2012 15:59:20;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Cannot
> allocate memory (12) in pipe_and_read_unmunge, Unable to popen command
> 'unmunge
> --input=/var/spool/torque/server_priv/credentials/munge-15-59-20-640705'
> for reading
>
> I took the core dump of a process nearing to 4GB of RSS and VIRT memory.
>
> My question is how can I determine which part of server is leaking
> memory from the core file?
>
> Cheers
> --
> Lukasz Flis
>
> Lukasz,

What scheduler are you using to run your grid?

We did fix a large memory leak on the MOM in 2.5.12 but that obviously
won't help here.

Valgrind will tell us where the memory is leaking. Are you able to run the
server under Valgrind?

Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20120709/1cb6cec5/attachment.html 


More information about the torquedev mailing list