[torqueusers] 4.1.5.1 memory leak

Steven Lo slo at cacr.caltech.edu
Tue Dec 10 14:02:36 MST 2013


HI David,

Thanks for the info.  Very helpful.

It would be nice if we can get a fix for the 4.1 branch since it would 
provide a quick fix
for our cluster (some jobs are failing due to lack of memory). That will 
also give us
a little bit of breathing room for the upgrade.  I presume the fix will 
affect pbs_mom
binary only?

Again.  Thanks much for your help.

Steven.


On 12/10/2013 12:42 PM, David Beer wrote:
> Oh, and to answer your other question it hasn't been checked in to the 
> 4.1-dev branch. If you need that let us know and we can probably port it.
>
>
> On Tue, Dec 10, 2013 at 1:41 PM, David Beer 
> <dbeer at adaptivecomputing.com <mailto:dbeer at adaptivecomputing.com>> wrote:
>
>     Steven,
>
>     4.2.7 is due out in mid-January. To clarify, "master" in git is
>     the equivalent of trunk in subversion or mercurial. When new major
>     releases are made, such as 4.5.0, they are branched from master.
>
>
>     On Tue, Dec 10, 2013 at 11:36 AM, Steven Lo <slo at cacr.caltech.edu
>     <mailto:slo at cacr.caltech.edu>> wrote:
>
>
>         Hi Ken,
>
>         Not sure what "master" means.  Does it include 4.1.x?
>
>         If there is no fix for 4.1.x, our only option is to upgrade to
>         4.2.x or 4.5.0.
>
>         BTW, do you know when 4.2.7 will be out officially?
>
>         Thanks.
>
>         Steven.
>
>
>
>         On 12/10/2013 10:27 AM, Ken Nielson wrote:
>>         Steven,
>>
>>         We found the problem and have fixed it. You will find the fix
>>         in Github for 4.2-dev, 4.5.0 and master.
>>
>>         The fix was made on December 6 by me (knielson). The
>>         description is "Fixed a problem where the stack size was not
>>         getting set"
>>
>>         The commit hashes are as follows:
>>
>>         master 9c584930a20cd6d99b1690b9fd82747d794e021b
>>         4.2-dev f78ae1d290dd9279af8b4f20a74e188c61ed22d3
>>         4.5.0 12c8e5b6deeb8ea4af6f46767b21eca846663a42
>>
>>         You can create a patch from these if you need.
>>
>>         Let me know if you need anything else.
>>
>>         Regards
>>
>>
>>
>>
>>         On Mon, Dec 9, 2013 at 12:50 PM, Steven Lo
>>         <slo at cacr.caltech.edu <mailto:slo at cacr.caltech.edu>> wrote:
>>
>>
>>             We are using libc 2.5.
>>
>>             The stack size for our host is set to 1600000 KB.  If
>>             there are 2 threads then pbs_mom will use ~3GB then?
>>
>>             stack size              (kbytes, -s) 1600000
>>
>>
>>             Is there a workaround to this problem?  If not, upgrade
>>             to 4.2.6.1 or 4.2.7 is recommended?
>>
>>             Thanks.
>>
>>             Steven.
>>
>>
>>
>>             On 12/06/2013 10:55 AM, Ken Nielson wrote:
>>>             My apologies.  I have been doing more testing and it
>>>             appears libc 2.18 still has the bug.  I will be
>>>             reporting this the GNU and hopefully they will have a
>>>             fix soon.
>>>
>>>             Again my apologies for jumping the gun.
>>>
>>>
>>>             On Fri, Dec 6, 2013 at 11:23 AM, Ken Nielson
>>>             <knielson at adaptivecomputing.com
>>>             <mailto:knielson at adaptivecomputing.com>> wrote:
>>>
>>>                 It is nice to have confirmation on what we found
>>>                 about the pthread_attr_setstacksize. As you said,
>>>                 the default stack size of a new thread is the value
>>>                 of ulimit -s * 1000. Currently the TORQUE code tries
>>>                 to make a minimum stack size but it does not check
>>>                 for a maximum. So if you have your stack size set to
>>>                 300,000 you will get 300,000,000 bytes allocated for
>>>                 each thread. On the mom there are two threads so you
>>>                 get 600 Mb allocated. On the server it gets huge.
>>>
>>>                 The bug in libc is that the
>>>                 pthread_attr_setstacksize is ignored. My development
>>>                 box is running libc version 2.15 and the bug is
>>>                 still there. However, I installed libc 2.18 on a
>>>                 CentOS 6 box and the bug appears to be fixed.  I am
>>>                 going to modify the MOM code to also set a maximum
>>>                 stack size.
>>>
>>>                 The TORQUE fix will show up in 4.2.7.
>>>
>>>                 Regards
>>>
>>>
>>>                 On Fri, Dec 6, 2013 at 10:52 AM, Steven Lo
>>>                 <slo at cacr.caltech.edu <mailto:slo at cacr.caltech.edu>>
>>>                 wrote:
>>>
>>>
>>>                     Hi David,
>>>
>>>                     The nodes which we observed are running the
>>>                     following version:
>>>
>>>                     -bash-3.2# ldd /opt/torque/sbin/pbs_mom | grep
>>>                     libc.so
>>>                         libc.so.6 => /lib64/libc.so.6
>>>                     (0x00002b18eae2a000)
>>>
>>>                     -bash-3.2# ldd --version
>>>                     ldd (GNU libc) 2.5
>>>
>>>
>>>                     -bash-3.2# qstat --version
>>>                     Version: 4.1.5.1
>>>                     Revision:
>>>
>>>                     -bash-3.2# uname -a
>>>                     Linux zwicky005 2.6.18-308.1.1.el5 #1 SMP Fri
>>>                     Feb 17 16:51:01 EST 2012 x86_64 x86_64 x86_64
>>>                     GNU/Linux
>>>
>>>
>>>
>>>                     We see that it's using ~3G of memory:
>>>
>>>                     -bash-3.2# top -p 16695
>>>
>>>                     top - 09:46:45 up 81 days, 1:01,  1 user,  load
>>>                     average: 9.19, 9.17, 9.11
>>>                     Tasks:   1 total,   0 running,   1 sleeping,   0
>>>                     stopped,   0 zombie
>>>                     Cpu(s): 74.6%us, 0.7%sy, 0.0%ni, 24.6%id,
>>>                     0.0%wa, 0.0%hi, 0.0%si, 0.0%st
>>>                     Mem: 24675856k total, 24286304k used, 389552k
>>>                     free, 497860k buffers
>>>                     Swap: 49150856k total, 4750564k used, 44400292k
>>>                     free, 10798448k cached
>>>
>>>                       PID USER      PR NI  VIRT  RES SHR S %CPU
>>>                     %MEM    TIME+ COMMAND
>>>                     16695 root      15 0 3195m 3.1g 7052 S  0.3
>>>                     13.1  77:50.71 pbs_mom
>>>
>>>
>>>                     We came across this posting and not sure if this
>>>                     is relevant:
>>>
>>>                     http://comments.gmane.org/gmane.comp.clustering.torque.user/13557
>>>
>>>
>>>                     Thanks for looking into this.
>>>
>>>                     Steven.
>>>
>>>
>>>                     On 12/06/2013 09:04 AM, David Beer wrote:
>>>>                     The issue is that in some versions of libc, the
>>>>                     pthread stack size will default to 1000 * <the
>>>>                     value set in ulimit -s>, even though TORQUE
>>>>                     specifies what stack size each thread should
>>>>                     have. I will work to get a list of the versions
>>>>                     of libc that have this bug. Ken is the one that
>>>>                     discovered this defect, so I'll ask him for the
>>>>                     info or ask him to post the info.
>>>>
>>>>
>>>>                     On Fri, Dec 6, 2013 at 9:02 AM, Gus Correa
>>>>                     <gus at ldeo.columbia.edu
>>>>                     <mailto:gus at ldeo.columbia.edu>> wrote:
>>>>
>>>>                         David
>>>>
>>>>                         For the benefit of all Torque users,
>>>>                         could you please disclose all combinations
>>>>                         of libc versions
>>>>                         and Torque versions that have this problem?
>>>>
>>>>                         Thank you,
>>>>                         Gus Correa
>>>>
>>>>                         On 12/05/2013 08:52 PM, David Beer wrote:
>>>>                         > Steven,
>>>>                         >
>>>>                         > What OS and version of the pthread
>>>>                         library (libc) do you have? We know
>>>>                         > of a rather large memory leak related to
>>>>                         different versions these libraries.
>>>>                         >
>>>>                         >
>>>>                         > On Thu, Dec 5, 2013 at 12:01 PM, Steven
>>>>                         Lo <slo at cacr.caltech.edu
>>>>                         <mailto:slo at cacr.caltech.edu>
>>>>                         > <mailto:slo at cacr.caltech.edu
>>>>                         <mailto:slo at cacr.caltech.edu>>> wrote:
>>>>                         >
>>>>                         >
>>>>                         >     Hi,
>>>>                         >
>>>>                         >     We've discovered that pbs_mom on most
>>>>                         nodes are using over 3GB of
>>>>                         > memory.
>>>>                         >     Is there a known memory leak issue
>>>>                         for version 4.1.5.1?  If so, is there
>>>>                         >     a patch for
>>>>                         >     it or we have to upgrade to other
>>>>                         version like 4.1.7 or 4.2.6.1?
>>>>                         >
>>>>                         > Thanks in advance for your suggestion.
>>>>                         >
>>>>                         > Steven.
>>>>                         >
>>>>                         >
>>>>                         _______________________________________________
>>>>                         > torqueusers mailing list
>>>>                         > torqueusers at supercluster.org
>>>>                         <mailto:torqueusers at supercluster.org>
>>>>                         <mailto:torqueusers at supercluster.org
>>>>                         <mailto:torqueusers at supercluster.org>>
>>>>                         >
>>>>                         http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>                         >
>>>>                         >
>>>>                         >
>>>>                         >
>>>>                         > --
>>>>                         > David Beer | Senior Software Engineer
>>>>                         > Adaptive Computing
>>>>                         >
>>>>                         >
>>>>                         >
>>>>                         _______________________________________________
>>>>                         > torqueusers mailing list
>>>>                         > torqueusers at supercluster.org
>>>>                         <mailto:torqueusers at supercluster.org>
>>>>                         >
>>>>                         http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>                         _______________________________________________
>>>>                         torqueusers mailing list
>>>>                         torqueusers at supercluster.org
>>>>                         <mailto:torqueusers at supercluster.org>
>>>>                         http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>>
>>>>
>>>>                     -- 
>>>>                     David Beer | Senior Software Engineer
>>>>                     Adaptive Computing
>>>>
>>>>
>>>>                     _______________________________________________
>>>>                     torqueusers mailing list
>>>>                     torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>>>                     http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>                     _______________________________________________
>>>                     torqueusers mailing list
>>>                     torqueusers at supercluster.org
>>>                     <mailto:torqueusers at supercluster.org>
>>>                     http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>>
>>>                 -- 
>>>                 Ken Nielson
>>>                 +1 801.717.3700 <tel:%2B1%20801.717.3700> office +1
>>>                 801.717.3738 <tel:%2B1%20801.717.3738> fax
>>>                 1712 S. East Bay Blvd, Suite 300 Provo, UT  84606
>>>                 www.adaptivecomputing.com
>>>                 <http://www.adaptivecomputing.com>
>>>
>>>
>>>
>>>
>>>             -- 
>>>             Ken Nielson
>>>             +1 801.717.3700 <tel:%2B1%20801.717.3700> office +1
>>>             801.717.3738 <tel:%2B1%20801.717.3738> fax
>>>             1712 S. East Bay Blvd, Suite 300  Provo, UT 84606
>>>             www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>>>
>>>
>>>
>>>             _______________________________________________
>>>             torqueusers mailing list
>>>             torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>>             http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>             _______________________________________________
>>             torqueusers mailing list
>>             torqueusers at supercluster.org
>>             <mailto:torqueusers at supercluster.org>
>>             http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>>         -- 
>>         Ken Nielson
>>         +1 801.717.3700 <tel:%2B1%20801.717.3700> office +1
>>         801.717.3738 <tel:%2B1%20801.717.3738> fax
>>         1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>         www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>>
>>
>>
>>         _______________________________________________
>>         torqueusers mailing list
>>         torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>         http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>         _______________________________________________
>         torqueusers mailing list
>         torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>         http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>     -- 
>     David Beer | Senior Software Engineer
>     Adaptive Computing
>
>
>
>
> -- 
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131210/a7fd4a6a/attachment-0001.html 


More information about the torqueusers mailing list