[torqueusers] 4.1.5.1 memory leak

Ken Nielson knielson at adaptivecomputing.com
Fri Dec 6 11:55:13 MST 2013


My apologies.  I have been doing more testing and it appears libc 2.18
still has the bug.  I will be reporting this the GNU and hopefully they
will have a fix soon.

Again my apologies for jumping the gun.


On Fri, Dec 6, 2013 at 11:23 AM, Ken Nielson <knielson at adaptivecomputing.com
> wrote:

> It is nice to have confirmation on what we found about the
> pthread_attr_setstacksize. As you said, the default stack size of a new
> thread is the value of ulimit -s * 1000. Currently the TORQUE code tries to
> make a minimum stack size but it does not check for a maximum. So if you
> have your stack size set to 300,000 you will get 300,000,000 bytes
> allocated for each thread. On the mom there are two threads so you get 600
> Mb allocated. On the server it gets huge.
>
> The bug in libc is that the pthread_attr_setstacksize is ignored. My
> development box is running libc version 2.15 and the bug is still there.
> However, I installed libc 2.18 on a CentOS 6 box and the bug appears to be
> fixed.  I am going to modify the MOM code to also set a maximum stack size.
>
> The TORQUE fix will show up in 4.2.7.
>
> Regards
>
>
> On Fri, Dec 6, 2013 at 10:52 AM, Steven Lo <slo at cacr.caltech.edu> wrote:
>
>>
>> Hi David,
>>
>> The nodes which we observed are running the following version:
>>
>> -bash-3.2# ldd /opt/torque/sbin/pbs_mom | grep libc.so
>>     libc.so.6 => /lib64/libc.so.6 (0x00002b18eae2a000)
>>
>> -bash-3.2# ldd --version
>> ldd (GNU libc) 2.5
>>
>>
>> -bash-3.2# qstat --version
>> Version: 4.1.5.1
>> Revision:
>>
>> -bash-3.2# uname -a
>> Linux zwicky005 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17 16:51:01 EST 2012
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>>
>> We see that it's using ~3G of memory:
>>
>> -bash-3.2# top -p 16695
>>
>> top - 09:46:45 up 81 days,  1:01,  1 user,  load average: 9.19, 9.17, 9.11
>> Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 74.6%us,  0.7%sy,  0.0%ni, 24.6%id,  0.0%wa,  0.0%hi,  0.0%si,
>> 0.0%st
>> Mem:  24675856k total, 24286304k used,   389552k free,   497860k buffers
>> Swap: 49150856k total,  4750564k used, 44400292k free, 10798448k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>> 16695 root      15   0 3195m 3.1g 7052 S  0.3 13.1  77:50.71
>> pbs_mom
>>
>>
>> We came across this posting and not sure if this is relevant:
>>
>> http://comments.gmane.org/gmane.comp.clustering.torque.user/13557
>>
>>
>> Thanks for looking into this.
>>
>> Steven.
>>
>>
>> On 12/06/2013 09:04 AM, David Beer wrote:
>>
>> The issue is that in some versions of libc, the pthread stack size will
>> default to 1000 * <the value set in ulimit -s>, even though TORQUE
>> specifies what stack size each thread should have. I will work to get a
>> list of the versions of libc that have this bug. Ken is the one that
>> discovered this defect, so I'll ask him for the info or ask him to post the
>> info.
>>
>>
>> On Fri, Dec 6, 2013 at 9:02 AM, Gus Correa <gus at ldeo.columbia.edu> wrote:
>>
>>> David
>>>
>>> For the benefit of all Torque users,
>>> could you please disclose all combinations of libc versions
>>> and Torque versions that have this problem?
>>>
>>> Thank you,
>>> Gus Correa
>>>
>>> On 12/05/2013 08:52 PM, David Beer wrote:
>>> > Steven,
>>> >
>>> > What OS and version of the pthread library (libc) do you have? We know
>>> > of a rather large memory leak related to different versions these
>>> libraries.
>>> >
>>> >
>>> > On Thu, Dec 5, 2013 at 12:01 PM, Steven Lo <slo at cacr.caltech.edu
>>>  > <mailto:slo at cacr.caltech.edu>> wrote:
>>> >
>>> >
>>> >     Hi,
>>> >
>>> >     We've discovered that pbs_mom on most nodes are using over 3GB of
>>> >     memory.
>>> >     Is there a known memory leak issue for version 4.1.5.1?  If so, is
>>> there
>>> >     a patch for
>>> >     it or we have to upgrade to other version like 4.1.7 or 4.2.6.1?
>>> >
>>> >     Thanks in advance for your suggestion.
>>> >
>>> >     Steven.
>>> >
>>> >     _______________________________________________
>>> >     torqueusers mailing list
>>>  >     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org
>>> >
>>>  >     http://www.supercluster.org/mailman/listinfo/torqueusers
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > David Beer | Senior Software Engineer
>>> > Adaptive Computing
>>> >
>>> >
>>> > _______________________________________________
>>> > torqueusers mailing list
>>> > torqueusers at supercluster.org
>>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>
>>
>>
>>  --
>> David Beer | Senior Software Engineer
>> Adaptive Computing
>>
>>
>> _______________________________________________
>> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com
>
>


-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131206/891b9b48/attachment.html 


More information about the torqueusers mailing list