[torqueusers] 4.1.5.1 memory leak

David Beer dbeer at adaptivecomputing.com
Tue Dec 10 13:42:50 MST 2013


Oh, and to answer your other question it hasn't been checked in to the
4.1-dev branch. If you need that let us know and we can probably port it.


On Tue, Dec 10, 2013 at 1:41 PM, David Beer <dbeer at adaptivecomputing.com>wrote:

> Steven,
>
> 4.2.7 is due out in mid-January. To clarify, "master" in git is the
> equivalent of trunk in subversion or mercurial. When new major releases are
> made, such as 4.5.0, they are branched from master.
>
>
> On Tue, Dec 10, 2013 at 11:36 AM, Steven Lo <slo at cacr.caltech.edu> wrote:
>
>>
>> Hi Ken,
>>
>> Not sure what "master" means.  Does it include 4.1.x?
>>
>> If there is no fix for 4.1.x, our only option is to upgrade to 4.2.x or
>> 4.5.0.
>>
>> BTW, do you know when 4.2.7 will be out officially?
>>
>> Thanks.
>>
>> Steven.
>>
>>
>>
>> On 12/10/2013 10:27 AM, Ken Nielson wrote:
>>
>>    Steven,
>>
>>  We found the problem and have fixed it. You will find the fix in Github
>> for 4.2-dev, 4.5.0 and master.
>>
>>  The fix was made on December 6 by me (knielson). The description is
>> "Fixed a problem where the stack size was not getting set"
>>
>>  The commit hashes are as follows:
>>
>>  master     9c584930a20cd6d99b1690b9fd82747d794e021b
>>  4.2-dev    f78ae1d290dd9279af8b4f20a74e188c61ed22d3
>> 4.5.0        12c8e5b6deeb8ea4af6f46767b21eca846663a42
>>
>>  You can create a patch from these if you need.
>>
>> Let me know if you need anything else.
>>
>> Regards
>>
>>
>>
>>
>> On Mon, Dec 9, 2013 at 12:50 PM, Steven Lo <slo at cacr.caltech.edu> wrote:
>>
>>>
>>> We are using libc 2.5.
>>>
>>> The stack size for our host is set to 1600000 KB.  If there are 2
>>> threads then pbs_mom will use ~3GB then?
>>>
>>> stack size              (kbytes, -s) 1600000
>>>
>>>
>>> Is there a workaround to this problem?  If not, upgrade to 4.2.6.1 or
>>> 4.2.7 is recommended?
>>>
>>> Thanks.
>>>
>>> Steven.
>>>
>>>
>>>
>>> On 12/06/2013 10:55 AM, Ken Nielson wrote:
>>>
>>>  My apologies.  I have been doing more testing and it appears libc 2.18
>>> still has the bug.  I will be reporting this the GNU and hopefully they
>>> will have a fix soon.
>>>
>>>  Again my apologies for jumping the gun.
>>>
>>>
>>> On Fri, Dec 6, 2013 at 11:23 AM, Ken Nielson <
>>> knielson at adaptivecomputing.com> wrote:
>>>
>>>>   It is nice to have confirmation on what we found about the
>>>> pthread_attr_setstacksize. As you said, the default stack size of a new
>>>> thread is the value of ulimit -s * 1000. Currently the TORQUE code tries to
>>>> make a minimum stack size but it does not check for a maximum. So if you
>>>> have your stack size set to 300,000 you will get 300,000,000 bytes
>>>> allocated for each thread. On the mom there are two threads so you get 600
>>>> Mb allocated. On the server it gets huge.
>>>>
>>>> The bug in libc is that the pthread_attr_setstacksize is ignored. My
>>>> development box is running libc version 2.15 and the bug is still there.
>>>> However, I installed libc 2.18 on a CentOS 6 box and the bug appears to be
>>>> fixed.  I am going to modify the MOM code to also set a maximum stack size.
>>>>
>>>>  The TORQUE fix will show up in 4.2.7.
>>>>
>>>>  Regards
>>>>
>>>>
>>>> On Fri, Dec 6, 2013 at 10:52 AM, Steven Lo <slo at cacr.caltech.edu>wrote:
>>>>
>>>>>
>>>>> Hi David,
>>>>>
>>>>> The nodes which we observed are running the following version:
>>>>>
>>>>> -bash-3.2# ldd /opt/torque/sbin/pbs_mom | grep libc.so
>>>>>     libc.so.6 => /lib64/libc.so.6 (0x00002b18eae2a000)
>>>>>
>>>>> -bash-3.2# ldd --version
>>>>> ldd (GNU libc) 2.5
>>>>>
>>>>>
>>>>> -bash-3.2# qstat --version
>>>>> Version: 4.1.5.1
>>>>> Revision:
>>>>>
>>>>> -bash-3.2# uname -a
>>>>> Linux zwicky005 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17 16:51:01 EST 2012
>>>>> x86_64 x86_64 x86_64 GNU/Linux
>>>>>
>>>>>
>>>>>
>>>>> We see that it's using ~3G of memory:
>>>>>
>>>>> -bash-3.2# top -p 16695
>>>>>
>>>>> top - 09:46:45 up 81 days,  1:01,  1 user,  load average: 9.19, 9.17,
>>>>> 9.11
>>>>> Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
>>>>> Cpu(s): 74.6%us,  0.7%sy,  0.0%ni, 24.6%id,  0.0%wa,  0.0%hi,
>>>>> 0.0%si,  0.0%st
>>>>> Mem:  24675856k total, 24286304k used,   389552k free,   497860k
>>>>> buffers
>>>>> Swap: 49150856k total,  4750564k used, 44400292k free, 10798448k cached
>>>>>
>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>>>> COMMAND
>>>>> 16695 root      15   0 3195m 3.1g 7052 S  0.3 13.1  77:50.71
>>>>> pbs_mom
>>>>>
>>>>>
>>>>> We came across this posting and not sure if this is relevant:
>>>>>
>>>>> http://comments.gmane.org/gmane.comp.clustering.torque.user/13557
>>>>>
>>>>>
>>>>> Thanks for looking into this.
>>>>>
>>>>> Steven.
>>>>>
>>>>>
>>>>> On 12/06/2013 09:04 AM, David Beer wrote:
>>>>>
>>>>> The issue is that in some versions of libc, the pthread stack size
>>>>> will default to 1000 * <the value set in ulimit -s>, even though TORQUE
>>>>> specifies what stack size each thread should have. I will work to get a
>>>>> list of the versions of libc that have this bug. Ken is the one that
>>>>> discovered this defect, so I'll ask him for the info or ask him to post the
>>>>> info.
>>>>>
>>>>>
>>>>> On Fri, Dec 6, 2013 at 9:02 AM, Gus Correa <gus at ldeo.columbia.edu>wrote:
>>>>>
>>>>>> David
>>>>>>
>>>>>> For the benefit of all Torque users,
>>>>>> could you please disclose all combinations of libc versions
>>>>>> and Torque versions that have this problem?
>>>>>>
>>>>>> Thank you,
>>>>>> Gus Correa
>>>>>>
>>>>>> On 12/05/2013 08:52 PM, David Beer wrote:
>>>>>> > Steven,
>>>>>> >
>>>>>> > What OS and version of the pthread library (libc) do you have? We
>>>>>> know
>>>>>> > of a rather large memory leak related to different versions these
>>>>>> libraries.
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Dec 5, 2013 at 12:01 PM, Steven Lo <slo at cacr.caltech.edu
>>>>>>  > <mailto:slo at cacr.caltech.edu>> wrote:
>>>>>> >
>>>>>> >
>>>>>> >     Hi,
>>>>>> >
>>>>>> >     We've discovered that pbs_mom on most nodes are using over 3GB
>>>>>> of
>>>>>> >     memory.
>>>>>> >     Is there a known memory leak issue for version 4.1.5.1?  If so,
>>>>>> is there
>>>>>> >     a patch for
>>>>>> >     it or we have to upgrade to other version like 4.1.7 or 4.2.6.1?
>>>>>> >
>>>>>> >     Thanks in advance for your suggestion.
>>>>>> >
>>>>>> >     Steven.
>>>>>> >
>>>>>> >     _______________________________________________
>>>>>> >     torqueusers mailing list
>>>>>>  >     torqueusers at supercluster.org <mailto:
>>>>>> torqueusers at supercluster.org>
>>>>>>  >     http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > David Beer | Senior Software Engineer
>>>>>> > Adaptive Computing
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > torqueusers mailing list
>>>>>> > torqueusers at supercluster.org
>>>>>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>>> David Beer | Senior Software Engineer
>>>>> Adaptive Computing
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>  Ken Nielson
>>>> +1 801.717.3700 <%2B1%20801.717.3700> office +1 801.717.3738<%2B1%20801.717.3738>fax
>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>> www.adaptivecomputing.com
>>>>
>>>>
>>>
>>>
>>> --
>>> Ken Nielson
>>> +1 801.717.3700 <%2B1%20801.717.3700> office +1 801.717.3738<%2B1%20801.717.3738>fax
>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>> www.adaptivecomputing.com
>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>> Ken Nielson
>> +1 801.717.3700 office +1 801.717.3738 fax
>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>> www.adaptivecomputing.com
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131210/061c88b0/attachment-0001.html 


More information about the torqueusers mailing list