[torqueusers] 4.1.5.1 memory leak

Steven Lo slo at cacr.caltech.edu
Mon Dec 9 12:50:03 MST 2013


We are using libc 2.5.

The stack size for our host is set to 1600000 KB.  If there are 2 
threads then pbs_mom will use ~3GB then?

stack size              (kbytes, -s) 1600000


Is there a workaround to this problem?  If not, upgrade to 4.2.6.1 or 
4.2.7 is recommended?

Thanks.

Steven.


On 12/06/2013 10:55 AM, Ken Nielson wrote:
> My apologies.  I have been doing more testing and it appears libc 2.18 
> still has the bug.  I will be reporting this the GNU and hopefully 
> they will have a fix soon.
>
> Again my apologies for jumping the gun.
>
>
> On Fri, Dec 6, 2013 at 11:23 AM, Ken Nielson 
> <knielson at adaptivecomputing.com 
> <mailto:knielson at adaptivecomputing.com>> wrote:
>
>     It is nice to have confirmation on what we found about the
>     pthread_attr_setstacksize. As you said, the default stack size of
>     a new thread is the value of ulimit -s * 1000. Currently the
>     TORQUE code tries to make a minimum stack size but it does not
>     check for a maximum. So if you have your stack size set to 300,000
>     you will get 300,000,000 bytes allocated for each thread. On the
>     mom there are two threads so you get 600 Mb allocated. On the
>     server it gets huge.
>
>     The bug in libc is that the pthread_attr_setstacksize is ignored.
>     My development box is running libc version 2.15 and the bug is
>     still there. However, I installed libc 2.18 on a CentOS 6 box and
>     the bug appears to be fixed.  I am going to modify the MOM code to
>     also set a maximum stack size.
>
>     The TORQUE fix will show up in 4.2.7.
>
>     Regards
>
>
>     On Fri, Dec 6, 2013 at 10:52 AM, Steven Lo <slo at cacr.caltech.edu
>     <mailto:slo at cacr.caltech.edu>> wrote:
>
>
>         Hi David,
>
>         The nodes which we observed are running the following version:
>
>         -bash-3.2# ldd /opt/torque/sbin/pbs_mom | grep libc.so
>             libc.so.6 => /lib64/libc.so.6 (0x00002b18eae2a000)
>
>         -bash-3.2# ldd --version
>         ldd (GNU libc) 2.5
>
>
>         -bash-3.2# qstat --version
>         Version: 4.1.5.1
>         Revision:
>
>         -bash-3.2# uname -a
>         Linux zwicky005 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17 16:51:01
>         EST 2012 x86_64 x86_64 x86_64 GNU/Linux
>
>
>
>         We see that it's using ~3G of memory:
>
>         -bash-3.2# top -p 16695
>
>         top - 09:46:45 up 81 days,  1:01,  1 user, load average: 9.19,
>         9.17, 9.11
>         Tasks:   1 total,   0 running,   1 sleeping, 0 stopped,   0 zombie
>         Cpu(s): 74.6%us,  0.7%sy,  0.0%ni, 24.6%id, 0.0%wa,  0.0%hi, 
>         0.0%si,  0.0%st
>         Mem:  24675856k total, 24286304k used, 389552k free,   497860k
>         buffers
>         Swap: 49150856k total,  4750564k used, 44400292k free,
>         10798448k cached
>
>           PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+ 
>         COMMAND
>         16695 root      15   0 3195m 3.1g 7052 S  0.3 13.1  77:50.71
>         pbs_mom
>
>
>         We came across this posting and not sure if this is relevant:
>
>         http://comments.gmane.org/gmane.comp.clustering.torque.user/13557
>
>
>         Thanks for looking into this.
>
>         Steven.
>
>
>         On 12/06/2013 09:04 AM, David Beer wrote:
>>         The issue is that in some versions of libc, the pthread stack
>>         size will default to 1000 * <the value set in ulimit -s>,
>>         even though TORQUE specifies what stack size each thread
>>         should have. I will work to get a list of the versions of
>>         libc that have this bug. Ken is the one that discovered this
>>         defect, so I'll ask him for the info or ask him to post the info.
>>
>>
>>         On Fri, Dec 6, 2013 at 9:02 AM, Gus Correa
>>         <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>> wrote:
>>
>>             David
>>
>>             For the benefit of all Torque users,
>>             could you please disclose all combinations of libc versions
>>             and Torque versions that have this problem?
>>
>>             Thank you,
>>             Gus Correa
>>
>>             On 12/05/2013 08:52 PM, David Beer wrote:
>>             > Steven,
>>             >
>>             > What OS and version of the pthread library (libc) do
>>             you have? We know
>>             > of a rather large memory leak related to different
>>             versions these libraries.
>>             >
>>             >
>>             > On Thu, Dec 5, 2013 at 12:01 PM, Steven Lo
>>             <slo at cacr.caltech.edu <mailto:slo at cacr.caltech.edu>
>>             > <mailto:slo at cacr.caltech.edu
>>             <mailto:slo at cacr.caltech.edu>>> wrote:
>>             >
>>             >
>>             >     Hi,
>>             >
>>             >     We've discovered that pbs_mom on most nodes are
>>             using over 3GB of
>>             >     memory.
>>             >     Is there a known memory leak issue for version
>>             4.1.5.1?  If so, is there
>>             >     a patch for
>>             >     it or we have to upgrade to other version like
>>             4.1.7 or 4.2.6.1?
>>             >
>>             >     Thanks in advance for your suggestion.
>>             >
>>             >     Steven.
>>             >
>>             > _______________________________________________
>>             >     torqueusers mailing list
>>             > torqueusers at supercluster.org
>>             <mailto:torqueusers at supercluster.org>
>>             <mailto:torqueusers at supercluster.org
>>             <mailto:torqueusers at supercluster.org>>
>>             > http://www.supercluster.org/mailman/listinfo/torqueusers
>>             >
>>             >
>>             >
>>             >
>>             > --
>>             > David Beer | Senior Software Engineer
>>             > Adaptive Computing
>>             >
>>             >
>>             > _______________________________________________
>>             > torqueusers mailing list
>>             > torqueusers at supercluster.org
>>             <mailto:torqueusers at supercluster.org>
>>             > http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>             _______________________________________________
>>             torqueusers mailing list
>>             torqueusers at supercluster.org
>>             <mailto:torqueusers at supercluster.org>
>>             http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>>         -- 
>>         David Beer | Senior Software Engineer
>>         Adaptive Computing
>>
>>
>>         _______________________________________________
>>         torqueusers mailing list
>>         torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>         http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>         _______________________________________________
>         torqueusers mailing list
>         torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>         http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>     -- 
>     Ken Nielson
>     +1 801.717.3700 <tel:%2B1%20801.717.3700> office +1 801.717.3738
>     <tel:%2B1%20801.717.3738> fax
>     1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>     www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>
>
>
>
> -- 
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131209/52649fb2/attachment-0001.html 


More information about the torqueusers mailing list