[torqueusers] 4.1.5.1 memory leak

Steven Lo slo at cacr.caltech.edu
Thu Dec 12 17:06:04 MST 2013


Hi David,

Thank you very much for the fix.

Steven.


On 12/11/2013 09:25 AM, David Beer wrote:
> I cherry-picked the commit into 4.1-dev. I only had to help it along a 
> little.
>
> 8b0db74
>
>
> On Tue, Dec 10, 2013 at 2:02 PM, Steven Lo <slo at cacr.caltech.edu 
> <mailto:slo at cacr.caltech.edu>> wrote:
>
>
>     HI David,
>
>     Thanks for the info.  Very helpful.
>
>     It would be nice if we can get a fix for the 4.1 branch since it
>     would provide a quick fix
>     for our cluster (some jobs are failing due to lack of memory). 
>     That will also give us
>     a little bit of breathing room for the upgrade.  I presume the fix
>     will affect pbs_mom
>     binary only?
>
>     Again.  Thanks much for your help.
>
>     Steven.
>
>
>
>     On 12/10/2013 12:42 PM, David Beer wrote:
>>     Oh, and to answer your other question it hasn't been checked in
>>     to the 4.1-dev branch. If you need that let us know and we can
>>     probably port it.
>>
>>
>>     On Tue, Dec 10, 2013 at 1:41 PM, David Beer
>>     <dbeer at adaptivecomputing.com
>>     <mailto:dbeer at adaptivecomputing.com>> wrote:
>>
>>         Steven,
>>
>>         4.2.7 is due out in mid-January. To clarify, "master" in git
>>         is the equivalent of trunk in subversion or mercurial. When
>>         new major releases are made, such as 4.5.0, they are branched
>>         from master.
>>
>>
>>         On Tue, Dec 10, 2013 at 11:36 AM, Steven Lo
>>         <slo at cacr.caltech.edu <mailto:slo at cacr.caltech.edu>> wrote:
>>
>>
>>             Hi Ken,
>>
>>             Not sure what "master" means. Does it include 4.1.x?
>>
>>             If there is no fix for 4.1.x, our only option is to
>>             upgrade to 4.2.x or 4.5.0.
>>
>>             BTW, do you know when 4.2.7 will be out officially?
>>
>>             Thanks.
>>
>>             Steven.
>>
>>
>>
>>             On 12/10/2013 10:27 AM, Ken Nielson wrote:
>>>             Steven,
>>>
>>>             We found the problem and have fixed it. You will find
>>>             the fix in Github for 4.2-dev, 4.5.0 and master.
>>>
>>>             The fix was made on December 6 by me (knielson). The
>>>             description is "Fixed a problem where the stack size was
>>>             not getting set"
>>>
>>>             The commit hashes are as follows:
>>>
>>>             master 9c584930a20cd6d99b1690b9fd82747d794e021b
>>>             4.2-dev f78ae1d290dd9279af8b4f20a74e188c61ed22d3
>>>             4.5.0 12c8e5b6deeb8ea4af6f46767b21eca846663a42
>>>
>>>             You can create a patch from these if you need.
>>>
>>>             Let me know if you need anything else.
>>>
>>>             Regards
>>>
>>>
>>>
>>>
>>>             On Mon, Dec 9, 2013 at 12:50 PM, Steven Lo
>>>             <slo at cacr.caltech.edu <mailto:slo at cacr.caltech.edu>> wrote:
>>>
>>>
>>>                 We are using libc 2.5.
>>>
>>>                 The stack size for our host is set to 1600000 KB. 
>>>                 If there are 2 threads then pbs_mom will use ~3GB then?
>>>
>>>                 stack size (kbytes, -s) 1600000
>>>
>>>
>>>                 Is there a workaround to this problem?  If not,
>>>                 upgrade to 4.2.6.1 or 4.2.7 is recommended?
>>>
>>>                 Thanks.
>>>
>>>                 Steven.
>>>
>>>
>>>
>>>                 On 12/06/2013 10:55 AM, Ken Nielson wrote:
>>>>                 My apologies.  I have been doing more testing and
>>>>                 it appears libc 2.18 still has the bug.  I will be
>>>>                 reporting this the GNU and hopefully they will have
>>>>                 a fix soon.
>>>>
>>>>                 Again my apologies for jumping the gun.
>>>>
>>>>
>>>>                 On Fri, Dec 6, 2013 at 11:23 AM, Ken Nielson
>>>>                 <knielson at adaptivecomputing.com
>>>>                 <mailto:knielson at adaptivecomputing.com>> wrote:
>>>>
>>>>                     It is nice to have confirmation on what we
>>>>                     found about the pthread_attr_setstacksize. As
>>>>                     you said, the default stack size of a new
>>>>                     thread is the value of ulimit -s * 1000.
>>>>                     Currently the TORQUE code tries to make a
>>>>                     minimum stack size but it does not check for a
>>>>                     maximum. So if you have your stack size set to
>>>>                     300,000 you will get 300,000,000 bytes
>>>>                     allocated for each thread. On the mom there are
>>>>                     two threads so you get 600 Mb allocated. On the
>>>>                     server it gets huge.
>>>>
>>>>                     The bug in libc is that the
>>>>                     pthread_attr_setstacksize is ignored. My
>>>>                     development box is running libc version 2.15
>>>>                     and the bug is still there. However, I
>>>>                     installed libc 2.18 on a CentOS 6 box and the
>>>>                     bug appears to be fixed.  I am going to modify
>>>>                     the MOM code to also set a maximum stack size.
>>>>
>>>>                     The TORQUE fix will show up in 4.2.7.
>>>>
>>>>                     Regards
>>>>
>>>>
>>>>                     On Fri, Dec 6, 2013 at 10:52 AM, Steven Lo
>>>>                     <slo at cacr.caltech.edu
>>>>                     <mailto:slo at cacr.caltech.edu>> wrote:
>>>>
>>>>
>>>>                         Hi David,
>>>>
>>>>                         The nodes which we observed are running the
>>>>                         following version:
>>>>
>>>>                         -bash-3.2# ldd /opt/torque/sbin/pbs_mom |
>>>>                         grep libc.so
>>>>                             libc.so.6 => /lib64/libc.so.6
>>>>                         (0x00002b18eae2a000)
>>>>
>>>>                         -bash-3.2# ldd --version
>>>>                         ldd (GNU libc) 2.5
>>>>
>>>>
>>>>                         -bash-3.2# qstat --version
>>>>                         Version: 4.1.5.1
>>>>                         Revision:
>>>>
>>>>                         -bash-3.2# uname -a
>>>>                         Linux zwicky005 2.6.18-308.1.1.el5 #1 SMP
>>>>                         Fri Feb 17 16:51:01 EST 2012 x86_64 x86_64
>>>>                         x86_64 GNU/Linux
>>>>
>>>>
>>>>
>>>>                         We see that it's using ~3G of memory:
>>>>
>>>>                         -bash-3.2# top -p 16695
>>>>
>>>>                         top - 09:46:45 up 81 days, 1:01,  1 user, 
>>>>                         load average: 9.19, 9.17, 9.11
>>>>                         Tasks:   1 total,   0 running,   1
>>>>                         sleeping,   0 stopped,   0 zombie
>>>>                         Cpu(s): 74.6%us, 0.7%sy, 0.0%ni, 24.6%id,
>>>>                         0.0%wa, 0.0%hi, 0.0%si, 0.0%st
>>>>                         Mem: 24675856k total, 24286304k used,
>>>>                         389552k free, 497860k buffers
>>>>                         Swap: 49150856k total, 4750564k used,
>>>>                         44400292k free, 10798448k cached
>>>>
>>>>                           PID USER      PR NI  VIRT  RES SHR S %CPU
>>>>                         %MEM    TIME+ COMMAND
>>>>                         16695 root      15 0 3195m 3.1g 7052 S  0.3
>>>>                         13.1  77:50.71 pbs_mom
>>>>
>>>>
>>>>                         We came across this posting and not sure if
>>>>                         this is relevant:
>>>>
>>>>                         http://comments.gmane.org/gmane.comp.clustering.torque.user/13557
>>>>
>>>>
>>>>                         Thanks for looking into this.
>>>>
>>>>                         Steven.
>>>>
>>>>
>>>>                         On 12/06/2013 09:04 AM, David Beer wrote:
>>>>>                         The issue is that in some versions of
>>>>>                         libc, the pthread stack size will default
>>>>>                         to 1000 * <the value set in ulimit -s>,
>>>>>                         even though TORQUE specifies what stack
>>>>>                         size each thread should have. I will work
>>>>>                         to get a list of the versions of libc that
>>>>>                         have this bug. Ken is the one that
>>>>>                         discovered this defect, so I'll ask him
>>>>>                         for the info or ask him to post the info.
>>>>>
>>>>>
>>>>>                         On Fri, Dec 6, 2013 at 9:02 AM, Gus Correa
>>>>>                         <gus at ldeo.columbia.edu
>>>>>                         <mailto:gus at ldeo.columbia.edu>> wrote:
>>>>>
>>>>>                             David
>>>>>
>>>>>                             For the benefit of all Torque users,
>>>>>                             could you please disclose all
>>>>>                             combinations of libc versions
>>>>>                             and Torque versions that have this
>>>>>                             problem?
>>>>>
>>>>>                             Thank you,
>>>>>                             Gus Correa
>>>>>
>>>>>                             On 12/05/2013 08:52 PM, David Beer wrote:
>>>>>                             > Steven,
>>>>>                             >
>>>>>                             > What OS and version of the pthread
>>>>>                             library (libc) do you have? We know
>>>>>                             > of a rather large memory leak
>>>>>                             related to different versions these
>>>>>                             libraries.
>>>>>                             >
>>>>>                             >
>>>>>                             > On Thu, Dec 5, 2013 at 12:01 PM,
>>>>>                             Steven Lo <slo at cacr.caltech.edu
>>>>>                             <mailto:slo at cacr.caltech.edu>
>>>>>                             > <mailto:slo at cacr.caltech.edu
>>>>>                             <mailto:slo at cacr.caltech.edu>>> wrote:
>>>>>                             >
>>>>>                             >
>>>>>                             >     Hi,
>>>>>                             >
>>>>>                             >     We've discovered that pbs_mom on
>>>>>                             most nodes are using over 3GB of
>>>>>                             > memory.
>>>>>                             >     Is there a known memory leak
>>>>>                             issue for version 4.1.5.1?  If so, is
>>>>>                             there
>>>>>                             >     a patch for
>>>>>                             >     it or we have to upgrade to
>>>>>                             other version like 4.1.7 or 4.2.6.1?
>>>>>                             >
>>>>>                             > Thanks in advance for your suggestion.
>>>>>                             >
>>>>>                             > Steven.
>>>>>                             >
>>>>>                             >
>>>>>                             _______________________________________________
>>>>>                             > torqueusers mailing list
>>>>>                             > torqueusers at supercluster.org
>>>>>                             <mailto:torqueusers at supercluster.org>
>>>>>                             <mailto:torqueusers at supercluster.org
>>>>>                             <mailto:torqueusers at supercluster.org>>
>>>>>                             >
>>>>>                             http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>                             >
>>>>>                             >
>>>>>                             >
>>>>>                             >
>>>>>                             > --
>>>>>                             > David Beer | Senior Software Engineer
>>>>>                             > Adaptive Computing
>>>>>                             >
>>>>>                             >
>>>>>                             >
>>>>>                             _______________________________________________
>>>>>                             > torqueusers mailing list
>>>>>                             > torqueusers at supercluster.org
>>>>>                             <mailto:torqueusers at supercluster.org>
>>>>>                             >
>>>>>                             http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>                             _______________________________________________
>>>>>                             torqueusers mailing list
>>>>>                             torqueusers at supercluster.org
>>>>>                             <mailto:torqueusers at supercluster.org>
>>>>>                             http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                         -- 
>>>>>                         David Beer | Senior Software Engineer
>>>>>                         Adaptive Computing
>>>>>
>>>>>
>>>>>                         _______________________________________________
>>>>>                         torqueusers mailing list
>>>>>                         torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>>>>                         http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>>                         _______________________________________________
>>>>                         torqueusers mailing list
>>>>                         torqueusers at supercluster.org
>>>>                         <mailto:torqueusers at supercluster.org>
>>>>                         http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>>
>>>>
>>>>                     -- 
>>>>                     Ken Nielson
>>>>                     +1 801.717.3700 <tel:%2B1%20801.717.3700>
>>>>                     office +1 801.717.3738
>>>>                     <tel:%2B1%20801.717.3738> fax
>>>>                     1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
>>>>                     www.adaptivecomputing.com
>>>>                     <http://www.adaptivecomputing.com>
>>>>
>>>>
>>>>
>>>>
>>>>                 -- 
>>>>                 Ken Nielson
>>>>                 +1 801.717.3700 <tel:%2B1%20801.717.3700> office +1
>>>>                 801.717.3738 <tel:%2B1%20801.717.3738> fax
>>>>                 1712 S. East Bay Blvd, Suite 300 Provo, UT 84606
>>>>                 www.adaptivecomputing.com
>>>>                 <http://www.adaptivecomputing.com>
>>>>
>>>>
>>>>
>>>>                 _______________________________________________
>>>>                 torqueusers mailing list
>>>>                 torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>>>                 http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>                 _______________________________________________
>>>                 torqueusers mailing list
>>>                 torqueusers at supercluster.org
>>>                 <mailto:torqueusers at supercluster.org>
>>>                 http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>>
>>>             -- 
>>>             Ken Nielson
>>>             +1 801.717.3700 <tel:%2B1%20801.717.3700> office +1
>>>             801.717.3738 <tel:%2B1%20801.717.3738> fax
>>>             1712 S. East Bay Blvd, Suite 300  Provo, UT 84606
>>>             www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>>>
>>>
>>>
>>>             _______________________________________________
>>>             torqueusers mailing list
>>>             torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>>             http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>             _______________________________________________
>>             torqueusers mailing list
>>             torqueusers at supercluster.org
>>             <mailto:torqueusers at supercluster.org>
>>             http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>>         -- 
>>         David Beer | Senior Software Engineer
>>         Adaptive Computing
>>
>>
>>
>>
>>     -- 
>>     David Beer | Senior Software Engineer
>>     Adaptive Computing
>>
>>
>>     _______________________________________________
>>     torqueusers mailing list
>>     torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> -- 
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131212/c548b817/attachment-0001.html 


More information about the torqueusers mailing list