[torqueusers] 4.1.5.1 memory leak

Steven Lo slo at cacr.caltech.edu
Thu Dec 19 10:49:35 MST 2013


Hi Ken,

Thanks for the info.

We are planning to upgrade either to 4.2.7 or 4.5.0 official release 
when they are ready.

BTW, we did test the patch you and David provided and it works as 
expected.  Thanks bunch.


Steven.


On 12/12/2013 06:51 PM, Ken Nielson wrote:
> Steven,
>
> Github has all of the source for all versions of TORQUE. The master 
> branch is the latest work and is not considered stable.
>
> The 4.5.0 branch will be officially released in January or February. I 
> do not know when the 4.2.7 code base will be released. It will 
> probably be release in January or February as well but I have not 
> heard anything official.
>
> The 4.5.0 code base is frozen and there won't be many changes before 
> it releases. But if you do take it now expect to install the official 
> version later.
>
> If you have questions about github let me know.
>
> Regards
>
>
> On Tue, Dec 10, 2013 at 11:36 AM, Steven Lo <slo at cacr.caltech.edu 
> <mailto:slo at cacr.caltech.edu>> wrote:
>
>
>     Hi Ken,
>
>     Not sure what "master" means.  Does it include 4.1.x?
>
>     If there is no fix for 4.1.x, our only option is to upgrade to
>     4.2.x or 4.5.0.
>
>     BTW, do you know when 4.2.7 will be out officially?
>
>     Thanks.
>
>     Steven.
>
>
>
>     On 12/10/2013 10:27 AM, Ken Nielson wrote:
>>     Steven,
>>
>>     We found the problem and have fixed it. You will find the fix in
>>     Github for 4.2-dev, 4.5.0 and master.
>>
>>     The fix was made on December 6 by me (knielson). The description
>>     is "Fixed a problem where the stack size was not getting set"
>>
>>     The commit hashes are as follows:
>>
>>     master 9c584930a20cd6d99b1690b9fd82747d794e021b
>>     4.2-dev f78ae1d290dd9279af8b4f20a74e188c61ed22d3
>>     4.5.0 12c8e5b6deeb8ea4af6f46767b21eca846663a42
>>
>>     You can create a patch from these if you need.
>>
>>     Let me know if you need anything else.
>>
>>     Regards
>>
>>
>>
>>
>>     On Mon, Dec 9, 2013 at 12:50 PM, Steven Lo <slo at cacr.caltech.edu
>>     <mailto:slo at cacr.caltech.edu>> wrote:
>>
>>
>>         We are using libc 2.5.
>>
>>         The stack size for our host is set to 1600000 KB.  If there
>>         are 2 threads then pbs_mom will use ~3GB then?
>>
>>         stack size              (kbytes, -s) 1600000
>>
>>
>>         Is there a workaround to this problem?  If not, upgrade to
>>         4.2.6.1 or 4.2.7 is recommended?
>>
>>         Thanks.
>>
>>         Steven.
>>
>>
>>
>>         On 12/06/2013 10:55 AM, Ken Nielson wrote:
>>>         My apologies.  I have been doing more testing and it appears
>>>         libc 2.18 still has the bug.  I will be reporting this the
>>>         GNU and hopefully they will have a fix soon.
>>>
>>>         Again my apologies for jumping the gun.
>>>
>>>
>>>         On Fri, Dec 6, 2013 at 11:23 AM, Ken Nielson
>>>         <knielson at adaptivecomputing.com
>>>         <mailto:knielson at adaptivecomputing.com>> wrote:
>>>
>>>             It is nice to have confirmation on what we found about
>>>             the pthread_attr_setstacksize. As you said, the default
>>>             stack size of a new thread is the value of ulimit -s *
>>>             1000. Currently the TORQUE code tries to make a minimum
>>>             stack size but it does not check for a maximum. So if
>>>             you have your stack size set to 300,000 you will get
>>>             300,000,000 bytes allocated for each thread. On the mom
>>>             there are two threads so you get 600 Mb allocated. On
>>>             the server it gets huge.
>>>
>>>             The bug in libc is that the pthread_attr_setstacksize is
>>>             ignored. My development box is running libc version 2.15
>>>             and the bug is still there. However, I installed libc
>>>             2.18 on a CentOS 6 box and the bug appears to be fixed. 
>>>             I am going to modify the MOM code to also set a maximum
>>>             stack size.
>>>
>>>             The TORQUE fix will show up in 4.2.7.
>>>
>>>             Regards
>>>
>>>
>>>             On Fri, Dec 6, 2013 at 10:52 AM, Steven Lo
>>>             <slo at cacr.caltech.edu <mailto:slo at cacr.caltech.edu>> wrote:
>>>
>>>
>>>                 Hi David,
>>>
>>>                 The nodes which we observed are running the
>>>                 following version:
>>>
>>>                 -bash-3.2# ldd /opt/torque/sbin/pbs_mom | grep libc.so
>>>                     libc.so.6 => /lib64/libc.so.6 (0x00002b18eae2a000)
>>>
>>>                 -bash-3.2# ldd --version
>>>                 ldd (GNU libc) 2.5
>>>
>>>
>>>                 -bash-3.2# qstat --version
>>>                 Version: 4.1.5.1
>>>                 Revision:
>>>
>>>                 -bash-3.2# uname -a
>>>                 Linux zwicky005 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17
>>>                 16:51:01 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>>
>>>                 We see that it's using ~3G of memory:
>>>
>>>                 -bash-3.2# top -p 16695
>>>
>>>                 top - 09:46:45 up 81 days,  1:01,  1 user,  load
>>>                 average: 9.19, 9.17, 9.11
>>>                 Tasks:   1 total,   0 running,   1 sleeping,   0
>>>                 stopped,   0 zombie
>>>                 Cpu(s): 74.6%us, 0.7%sy,  0.0%ni, 24.6%id,  0.0%wa,
>>>                 0.0%hi,  0.0%si, 0.0%st
>>>                 Mem:  24675856k total, 24286304k used,   389552k
>>>                 free,   497860k buffers
>>>                 Swap: 49150856k total,  4750564k used, 44400292k
>>>                 free, 10798448k cached
>>>
>>>                   PID USER PR  NI  VIRT  RES SHR S %CPU %MEM TIME+
>>>                 COMMAND
>>>                 16695 root 15   0 3195m 3.1g 7052 S  0.3 13.1
>>>                 77:50.71 pbs_mom
>>>
>>>
>>>                 We came across this posting and not sure if this is
>>>                 relevant:
>>>
>>>                 http://comments.gmane.org/gmane.comp.clustering.torque.user/13557
>>>
>>>
>>>                 Thanks for looking into this.
>>>
>>>                 Steven.
>>>
>>>
>>>                 On 12/06/2013 09:04 AM, David Beer wrote:
>>>>                 The issue is that in some versions of libc, the
>>>>                 pthread stack size will default to 1000 * <the
>>>>                 value set in ulimit -s>, even though TORQUE
>>>>                 specifies what stack size each thread should have.
>>>>                 I will work to get a list of the versions of libc
>>>>                 that have this bug. Ken is the one that discovered
>>>>                 this defect, so I'll ask him for the info or ask
>>>>                 him to post the info.
>>>>
>>>>
>>>>                 On Fri, Dec 6, 2013 at 9:02 AM, Gus Correa
>>>>                 <gus at ldeo.columbia.edu
>>>>                 <mailto:gus at ldeo.columbia.edu>> wrote:
>>>>
>>>>                     David
>>>>
>>>>                     For the benefit of all Torque users,
>>>>                     could you please disclose all combinations of
>>>>                     libc versions
>>>>                     and Torque versions that have this problem?
>>>>
>>>>                     Thank you,
>>>>                     Gus Correa
>>>>
>>>>                     On 12/05/2013 08:52 PM, David Beer wrote:
>>>>                     > Steven,
>>>>                     >
>>>>                     > What OS and version of the pthread library
>>>>                     (libc) do you have? We know
>>>>                     > of a rather large memory leak related to
>>>>                     different versions these libraries.
>>>>                     >
>>>>                     >
>>>>                     > On Thu, Dec 5, 2013 at 12:01 PM, Steven Lo
>>>>                     <slo at cacr.caltech.edu <mailto:slo at cacr.caltech.edu>
>>>>                     > <mailto:slo at cacr.caltech.edu
>>>>                     <mailto:slo at cacr.caltech.edu>>> wrote:
>>>>                     >
>>>>                     >
>>>>                     >     Hi,
>>>>                     >
>>>>                     >     We've discovered that pbs_mom on most
>>>>                     nodes are using over 3GB of
>>>>                     > memory.
>>>>                     >     Is there a known memory leak issue for
>>>>                     version 4.1.5.1?  If so, is there
>>>>                     >     a patch for
>>>>                     >     it or we have to upgrade to other version
>>>>                     like 4.1.7 or 4.2.6.1?
>>>>                     >
>>>>                     > Thanks in advance for your suggestion.
>>>>                     >
>>>>                     > Steven.
>>>>                     >
>>>>                     > _______________________________________________
>>>>                     > torqueusers mailing list
>>>>                     > torqueusers at supercluster.org
>>>>                     <mailto:torqueusers at supercluster.org>
>>>>                     <mailto:torqueusers at supercluster.org
>>>>                     <mailto:torqueusers at supercluster.org>>
>>>>                     >
>>>>                     http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>                     >
>>>>                     >
>>>>                     >
>>>>                     >
>>>>                     > --
>>>>                     > David Beer | Senior Software Engineer
>>>>                     > Adaptive Computing
>>>>                     >
>>>>                     >
>>>>                     > _______________________________________________
>>>>                     > torqueusers mailing list
>>>>                     > torqueusers at supercluster.org
>>>>                     <mailto:torqueusers at supercluster.org>
>>>>                     >
>>>>                     http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>                     _______________________________________________
>>>>                     torqueusers mailing list
>>>>                     torqueusers at supercluster.org
>>>>                     <mailto:torqueusers at supercluster.org>
>>>>                     http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>>
>>>>
>>>>                 -- 
>>>>                 David Beer | Senior Software Engineer
>>>>                 Adaptive Computing
>>>>
>>>>
>>>>                 _______________________________________________
>>>>                 torqueusers mailing list
>>>>                 torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>>>                 http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>                 _______________________________________________
>>>                 torqueusers mailing list
>>>                 torqueusers at supercluster.org
>>>                 <mailto:torqueusers at supercluster.org>
>>>                 http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>>
>>>             -- 
>>>             Ken Nielson
>>>             +1 801.717.3700 <tel:%2B1%20801.717.3700> office +1
>>>             801.717.3738 <tel:%2B1%20801.717.3738> fax
>>>             1712 S. East Bay Blvd, Suite 300  Provo, UT 84606
>>>             www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>>>
>>>
>>>
>>>
>>>         -- 
>>>         Ken Nielson
>>>         +1 801.717.3700 <tel:%2B1%20801.717.3700> office +1
>>>         801.717.3738 <tel:%2B1%20801.717.3738> fax
>>>         1712 S. East Bay Blvd, Suite 300 Provo, UT  84606
>>>         www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>>>
>>>
>>>
>>>         _______________________________________________
>>>         torqueusers mailing list
>>>         torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>>         http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>         _______________________________________________
>>         torqueusers mailing list
>>         torqueusers at supercluster.org
>>         <mailto:torqueusers at supercluster.org>
>>         http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>>     -- 
>>     Ken Nielson
>>     +1 801.717.3700 <tel:%2B1%20801.717.3700> office +1 801.717.3738
>>     <tel:%2B1%20801.717.3738> fax
>>     1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>     www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>>
>>
>>
>>     _______________________________________________
>>     torqueusers mailing list
>>     torqueusers at supercluster.org  <mailto:torqueusers at supercluster.org>
>>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> -- 
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com <http://www.adaptivecomputing.com>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131219/36d54c9f/attachment-0001.html 


More information about the torqueusers mailing list