[torqueusers] 4.1.5.1 memory leak

Ken Nielson knielson at adaptivecomputing.com
Fri Dec 6 11:23:21 MST 2013


It is nice to have confirmation on what we found about the
pthread_attr_setstacksize. As you said, the default stack size of a new
thread is the value of ulimit -s * 1000. Currently the TORQUE code tries to
make a minimum stack size but it does not check for a maximum. So if you
have your stack size set to 300,000 you will get 300,000,000 bytes
allocated for each thread. On the mom there are two threads so you get 600
Mb allocated. On the server it gets huge.

The bug in libc is that the pthread_attr_setstacksize is ignored. My
development box is running libc version 2.15 and the bug is still there.
However, I installed libc 2.18 on a CentOS 6 box and the bug appears to be
fixed.  I am going to modify the MOM code to also set a maximum stack size.

The TORQUE fix will show up in 4.2.7.

Regards


On Fri, Dec 6, 2013 at 10:52 AM, Steven Lo <slo at cacr.caltech.edu> wrote:

>
> Hi David,
>
> The nodes which we observed are running the following version:
>
> -bash-3.2# ldd /opt/torque/sbin/pbs_mom | grep libc.so
>     libc.so.6 => /lib64/libc.so.6 (0x00002b18eae2a000)
>
> -bash-3.2# ldd --version
> ldd (GNU libc) 2.5
>
>
> -bash-3.2# qstat --version
> Version: 4.1.5.1
> Revision:
>
> -bash-3.2# uname -a
> Linux zwicky005 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17 16:51:01 EST 2012
> x86_64 x86_64 x86_64 GNU/Linux
>
>
>
> We see that it's using ~3G of memory:
>
> -bash-3.2# top -p 16695
>
> top - 09:46:45 up 81 days,  1:01,  1 user,  load average: 9.19, 9.17, 9.11
> Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
> Cpu(s): 74.6%us,  0.7%sy,  0.0%ni, 24.6%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:  24675856k total, 24286304k used,   389552k free,   497860k buffers
> Swap: 49150856k total,  4750564k used, 44400292k free, 10798448k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
> 16695 root      15   0 3195m 3.1g 7052 S  0.3 13.1  77:50.71
> pbs_mom
>
>
> We came across this posting and not sure if this is relevant:
>
> http://comments.gmane.org/gmane.comp.clustering.torque.user/13557
>
>
> Thanks for looking into this.
>
> Steven.
>
>
> On 12/06/2013 09:04 AM, David Beer wrote:
>
> The issue is that in some versions of libc, the pthread stack size will
> default to 1000 * <the value set in ulimit -s>, even though TORQUE
> specifies what stack size each thread should have. I will work to get a
> list of the versions of libc that have this bug. Ken is the one that
> discovered this defect, so I'll ask him for the info or ask him to post the
> info.
>
>
> On Fri, Dec 6, 2013 at 9:02 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>
>> David
>>
>> For the benefit of all Torque users,
>> could you please disclose all combinations of libc versions
>> and Torque versions that have this problem?
>>
>> Thank you,
>> Gus Correa
>>
>> On 12/05/2013 08:52 PM, David Beer wrote:
>> > Steven,
>> >
>> > What OS and version of the pthread library (libc) do you have? We know
>> > of a rather large memory leak related to different versions these
>> libraries.
>> >
>> >
>> > On Thu, Dec 5, 2013 at 12:01 PM, Steven Lo <slo at cacr.caltech.edu
>>  > <mailto:slo at cacr.caltech.edu>> wrote:
>> >
>> >
>> >     Hi,
>> >
>> >     We've discovered that pbs_mom on most nodes are using over 3GB of
>> >     memory.
>> >     Is there a known memory leak issue for version 4.1.5.1?  If so, is
>> there
>> >     a patch for
>> >     it or we have to upgrade to other version like 4.1.7 or 4.2.6.1?
>> >
>> >     Thanks in advance for your suggestion.
>> >
>> >     Steven.
>> >
>> >     _______________________________________________
>> >     torqueusers mailing list
>>  >     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>>  >     http://www.supercluster.org/mailman/listinfo/torqueusers
>> >
>> >
>> >
>> >
>> > --
>> > David Beer | Senior Software Engineer
>> > Adaptive Computing
>> >
>> >
>> > _______________________________________________
>> > torqueusers mailing list
>> > torqueusers at supercluster.org
>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
>
>
>  --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131206/44308b60/attachment-0001.html 


More information about the torqueusers mailing list