[torqueusers] 4.1.5.1 memory leak

Steven Lo slo at cacr.caltech.edu
Tue Dec 10 11:36:05 MST 2013


Hi Ken,

Not sure what "master" means.  Does it include 4.1.x?

If there is no fix for 4.1.x, our only option is to upgrade to 4.2.x or 
4.5.0.

BTW, do you know when 4.2.7 will be out officially?

Thanks.

Steven.


On 12/10/2013 10:27 AM, Ken Nielson wrote:
> Steven,
>
> We found the problem and have fixed it. You will find the fix in 
> Github for 4.2-dev, 4.5.0 and master.
>
> The fix was made on December 6 by me (knielson). The description is 
> "Fixed a problem where the stack size was not getting set"
>
> The commit hashes are as follows:
>
> master     9c584930a20cd6d99b1690b9fd82747d794e021b
> 4.2-dev    f78ae1d290dd9279af8b4f20a74e188c61ed22d3
> 4.5.0        12c8e5b6deeb8ea4af6f46767b21eca846663a42
>
> You can create a patch from these if you need.
>
> Let me know if you need anything else.
>
> Regards
>
>
>
>
> On Mon, Dec 9, 2013 at 12:50 PM, Steven Lo <slo at cacr.caltech.edu 
> <mailto:slo at cacr.caltech.edu>> wrote:
>
>
>     We are using libc 2.5.
>
>     The stack size for our host is set to 1600000 KB.  If there are 2
>     threads then pbs_mom will use ~3GB then?
>
>     stack size              (kbytes, -s) 1600000
>
>
>     Is there a workaround to this problem?  If not, upgrade to 4.2.6.1
>     or 4.2.7 is recommended?
>
>     Thanks.
>
>     Steven.
>
>
>
>     On 12/06/2013 10:55 AM, Ken Nielson wrote:
>>     My apologies.  I have been doing more testing and it appears libc
>>     2.18 still has the bug.  I will be reporting this the GNU and
>>     hopefully they will have a fix soon.
>>
>>     Again my apologies for jumping the gun.
>>
>>
>>     On Fri, Dec 6, 2013 at 11:23 AM, Ken Nielson
>>     <knielson at adaptivecomputing.com
>>     <mailto:knielson at adaptivecomputing.com>> wrote:
>>
>>         It is nice to have confirmation on what we found about the
>>         pthread_attr_setstacksize. As you said, the default stack
>>         size of a new thread is the value of ulimit -s * 1000.
>>         Currently the TORQUE code tries to make a minimum stack size
>>         but it does not check for a maximum. So if you have your
>>         stack size set to 300,000 you will get 300,000,000 bytes
>>         allocated for each thread. On the mom there are two threads
>>         so you get 600 Mb allocated. On the server it gets huge.
>>
>>         The bug in libc is that the pthread_attr_setstacksize is
>>         ignored. My development box is running libc version 2.15 and
>>         the bug is still there. However, I installed libc 2.18 on a
>>         CentOS 6 box and the bug appears to be fixed.  I am going to
>>         modify the MOM code to also set a maximum stack size.
>>
>>         The TORQUE fix will show up in 4.2.7.
>>
>>         Regards
>>
>>
>>         On Fri, Dec 6, 2013 at 10:52 AM, Steven Lo
>>         <slo at cacr.caltech.edu <mailto:slo at cacr.caltech.edu>> wrote:
>>
>>
>>             Hi David,
>>
>>             The nodes which we observed are running the following
>>             version:
>>
>>             -bash-3.2# ldd /opt/torque/sbin/pbs_mom | grep libc.so
>>                 libc.so.6 => /lib64/libc.so.6 (0x00002b18eae2a000)
>>
>>             -bash-3.2# ldd --version
>>             ldd (GNU libc) 2.5
>>
>>
>>             -bash-3.2# qstat --version
>>             Version: 4.1.5.1
>>             Revision:
>>
>>             -bash-3.2# uname -a
>>             Linux zwicky005 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17
>>             16:51:01 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>>
>>             We see that it's using ~3G of memory:
>>
>>             -bash-3.2# top -p 16695
>>
>>             top - 09:46:45 up 81 days, 1:01,  1 user,  load average:
>>             9.19, 9.17, 9.11
>>             Tasks:   1 total,   0 running, 1 sleeping,   0 stopped,  
>>             0 zombie
>>             Cpu(s): 74.6%us,  0.7%sy, 0.0%ni, 24.6%id,  0.0%wa,
>>             0.0%hi,  0.0%si,  0.0%st
>>             Mem:  24675856k total, 24286304k used,   389552k free,  
>>             497860k buffers
>>             Swap: 49150856k total,  4750564k used, 44400292k free,
>>             10798448k cached
>>
>>               PID USER      PR  NI  VIRT RES  SHR S %CPU %MEM   
>>             TIME+ COMMAND
>>             16695 root      15   0 3195m 3.1g 7052 S  0.3 13.1 
>>             77:50.71 pbs_mom
>>
>>
>>             We came across this posting and not sure if this is relevant:
>>
>>             http://comments.gmane.org/gmane.comp.clustering.torque.user/13557
>>
>>
>>             Thanks for looking into this.
>>
>>             Steven.
>>
>>
>>             On 12/06/2013 09:04 AM, David Beer wrote:
>>>             The issue is that in some versions of libc, the pthread
>>>             stack size will default to 1000 * <the value set in
>>>             ulimit -s>, even though TORQUE specifies what stack size
>>>             each thread should have. I will work to get a list of
>>>             the versions of libc that have this bug. Ken is the one
>>>             that discovered this defect, so I'll ask him for the
>>>             info or ask him to post the info.
>>>
>>>
>>>             On Fri, Dec 6, 2013 at 9:02 AM, Gus Correa
>>>             <gus at ldeo.columbia.edu <mailto:gus at ldeo.columbia.edu>>
>>>             wrote:
>>>
>>>                 David
>>>
>>>                 For the benefit of all Torque users,
>>>                 could you please disclose all combinations of libc
>>>                 versions
>>>                 and Torque versions that have this problem?
>>>
>>>                 Thank you,
>>>                 Gus Correa
>>>
>>>                 On 12/05/2013 08:52 PM, David Beer wrote:
>>>                 > Steven,
>>>                 >
>>>                 > What OS and version of the pthread library (libc)
>>>                 do you have? We know
>>>                 > of a rather large memory leak related to different
>>>                 versions these libraries.
>>>                 >
>>>                 >
>>>                 > On Thu, Dec 5, 2013 at 12:01 PM, Steven Lo
>>>                 <slo at cacr.caltech.edu <mailto:slo at cacr.caltech.edu>
>>>                 > <mailto:slo at cacr.caltech.edu
>>>                 <mailto:slo at cacr.caltech.edu>>> wrote:
>>>                 >
>>>                 >
>>>                 >     Hi,
>>>                 >
>>>                 >     We've discovered that pbs_mom on most nodes
>>>                 are using over 3GB of
>>>                 >     memory.
>>>                 >     Is there a known memory leak issue for version
>>>                 4.1.5.1?  If so, is there
>>>                 >     a patch for
>>>                 >     it or we have to upgrade to other version like
>>>                 4.1.7 or 4.2.6.1?
>>>                 >
>>>                 >     Thanks in advance for your suggestion.
>>>                 >
>>>                 >     Steven.
>>>                 >
>>>                 > _______________________________________________
>>>                 >     torqueusers mailing list
>>>                 > torqueusers at supercluster.org
>>>                 <mailto:torqueusers at supercluster.org>
>>>                 <mailto:torqueusers at supercluster.org
>>>                 <mailto:torqueusers at supercluster.org>>
>>>                 >
>>>                 http://www.supercluster.org/mailman/listinfo/torqueusers
>>>                 >
>>>                 >
>>>                 >
>>>                 >
>>>                 > --
>>>                 > David Beer | Senior Software Engineer
>>>                 > Adaptive Computing
>>>                 >
>>>                 >
>>>                 > _______________________________________________
>>>                 > torqueusers mailing list
>>>                 > torqueusers at supercluster.org
>>>                 <mailto:torqueusers at supercluster.org>
>>>                 >
>>>                 http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>                 _______________________________________________
>>>                 torqueusers mailing list
>>>                 torqueusers at supercluster.org
>>>                 <mailto:torqueusers at supercluster.org>
>>>                 http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>>
>>>             -- 
>>>             David Beer | Senior Software Engineer
>>>             Adaptive Computing
>>>
>>>
>>>             _______________________________________________
>>>             torqueusers mailing list
>>>             torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>>             http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>             _______________________________________________
>>             torqueusers mailing list
>>             torqueusers at supercluster.org
>>             <mailto:torqueusers at supercluster.org>
>>             http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>>         -- 
>>         Ken Nielson
>>         +1 801.717.3700 <tel:%2B1%20801.717.3700> office +1
>>         801.717.3738 <tel:%2B1%20801.717.3738> fax
>>         1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>         www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>>
>>
>>
>>
>>     -- 
>>     Ken Nielson
>>     +1 801.717.3700 <tel:%2B1%20801.717.3700> office +1 801.717.3738
>>     <tel:%2B1%20801.717.3738> fax
>>     1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>     www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>>
>>
>>
>>     _______________________________________________
>>     torqueusers mailing list
>>     torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> -- 
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131210/d8bb96ae/attachment-0001.html 


More information about the torqueusers mailing list