[torqueusers] 4.1.5.1 memory leak

David Beer dbeer at adaptivecomputing.com
Wed Dec 11 10:25:06 MST 2013


I cherry-picked the commit into 4.1-dev. I only had to help it along a
little.

8b0db74


On Tue, Dec 10, 2013 at 2:02 PM, Steven Lo <slo at cacr.caltech.edu> wrote:

>
> HI David,
>
> Thanks for the info.  Very helpful.
>
> It would be nice if we can get a fix for the 4.1 branch since it would
> provide a quick fix
> for our cluster (some jobs are failing due to lack of memory).  That will
> also give us
> a little bit of breathing room for the upgrade.  I presume the fix will
> affect pbs_mom
> binary only?
>
> Again.  Thanks much for your help.
>
> Steven.
>
>
>
> On 12/10/2013 12:42 PM, David Beer wrote:
>
> Oh, and to answer your other question it hasn't been checked in to the
> 4.1-dev branch. If you need that let us know and we can probably port it.
>
>
>  On Tue, Dec 10, 2013 at 1:41 PM, David Beer <dbeer at adaptivecomputing.com>wrote:
>
>> Steven,
>>
>>  4.2.7 is due out in mid-January. To clarify, "master" in git is the
>> equivalent of trunk in subversion or mercurial. When new major releases are
>> made, such as 4.5.0, they are branched from master.
>>
>>
>> On Tue, Dec 10, 2013 at 11:36 AM, Steven Lo <slo at cacr.caltech.edu> wrote:
>>
>>>
>>> Hi Ken,
>>>
>>> Not sure what "master" means.  Does it include 4.1.x?
>>>
>>> If there is no fix for 4.1.x, our only option is to upgrade to 4.2.x or
>>> 4.5.0.
>>>
>>> BTW, do you know when 4.2.7 will be out officially?
>>>
>>> Thanks.
>>>
>>> Steven.
>>>
>>>
>>>
>>> On 12/10/2013 10:27 AM, Ken Nielson wrote:
>>>
>>>    Steven,
>>>
>>>  We found the problem and have fixed it. You will find the fix in Github
>>> for 4.2-dev, 4.5.0 and master.
>>>
>>>  The fix was made on December 6 by me (knielson). The description is
>>> "Fixed a problem where the stack size was not getting set"
>>>
>>>  The commit hashes are as follows:
>>>
>>>  master     9c584930a20cd6d99b1690b9fd82747d794e021b
>>>  4.2-dev    f78ae1d290dd9279af8b4f20a74e188c61ed22d3
>>> 4.5.0        12c8e5b6deeb8ea4af6f46767b21eca846663a42
>>>
>>>  You can create a patch from these if you need.
>>>
>>> Let me know if you need anything else.
>>>
>>> Regards
>>>
>>>
>>>
>>>
>>> On Mon, Dec 9, 2013 at 12:50 PM, Steven Lo <slo at cacr.caltech.edu> wrote:
>>>
>>>>
>>>> We are using libc 2.5.
>>>>
>>>> The stack size for our host is set to 1600000 KB.  If there are 2
>>>> threads then pbs_mom will use ~3GB then?
>>>>
>>>> stack size              (kbytes, -s) 1600000
>>>>
>>>>
>>>> Is there a workaround to this problem?  If not, upgrade to 4.2.6.1 or
>>>> 4.2.7 is recommended?
>>>>
>>>> Thanks.
>>>>
>>>> Steven.
>>>>
>>>>
>>>>
>>>> On 12/06/2013 10:55 AM, Ken Nielson wrote:
>>>>
>>>>  My apologies.  I have been doing more testing and it appears libc
>>>> 2.18 still has the bug.  I will be reporting this the GNU and hopefully
>>>> they will have a fix soon.
>>>>
>>>>  Again my apologies for jumping the gun.
>>>>
>>>>
>>>> On Fri, Dec 6, 2013 at 11:23 AM, Ken Nielson <
>>>> knielson at adaptivecomputing.com> wrote:
>>>>
>>>>>   It is nice to have confirmation on what we found about the
>>>>> pthread_attr_setstacksize. As you said, the default stack size of a new
>>>>> thread is the value of ulimit -s * 1000. Currently the TORQUE code tries to
>>>>> make a minimum stack size but it does not check for a maximum. So if you
>>>>> have your stack size set to 300,000 you will get 300,000,000 bytes
>>>>> allocated for each thread. On the mom there are two threads so you get 600
>>>>> Mb allocated. On the server it gets huge.
>>>>>
>>>>> The bug in libc is that the pthread_attr_setstacksize is ignored. My
>>>>> development box is running libc version 2.15 and the bug is still there.
>>>>> However, I installed libc 2.18 on a CentOS 6 box and the bug appears to be
>>>>> fixed.  I am going to modify the MOM code to also set a maximum stack size.
>>>>>
>>>>>  The TORQUE fix will show up in 4.2.7.
>>>>>
>>>>>  Regards
>>>>>
>>>>>
>>>>> On Fri, Dec 6, 2013 at 10:52 AM, Steven Lo <slo at cacr.caltech.edu>wrote:
>>>>>
>>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> The nodes which we observed are running the following version:
>>>>>>
>>>>>> -bash-3.2# ldd /opt/torque/sbin/pbs_mom | grep libc.so
>>>>>>     libc.so.6 => /lib64/libc.so.6 (0x00002b18eae2a000)
>>>>>>
>>>>>> -bash-3.2# ldd --version
>>>>>> ldd (GNU libc) 2.5
>>>>>>
>>>>>>
>>>>>> -bash-3.2# qstat --version
>>>>>> Version: 4.1.5.1
>>>>>> Revision:
>>>>>>
>>>>>> -bash-3.2# uname -a
>>>>>> Linux zwicky005 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17 16:51:01 EST
>>>>>> 2012 x86_64 x86_64 x86_64 GNU/Linux
>>>>>>
>>>>>>
>>>>>>
>>>>>> We see that it's using ~3G of memory:
>>>>>>
>>>>>> -bash-3.2# top -p 16695
>>>>>>
>>>>>> top - 09:46:45 up 81 days,  1:01,  1 user,  load average: 9.19, 9.17,
>>>>>> 9.11
>>>>>> Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
>>>>>> Cpu(s): 74.6%us,  0.7%sy,  0.0%ni, 24.6%id,  0.0%wa,  0.0%hi,
>>>>>> 0.0%si,  0.0%st
>>>>>> Mem:  24675856k total, 24286304k used,   389552k free,   497860k
>>>>>> buffers
>>>>>> Swap: 49150856k total,  4750564k used, 44400292k free, 10798448k
>>>>>> cached
>>>>>>
>>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>>>>> COMMAND
>>>>>> 16695 root      15   0 3195m 3.1g 7052 S  0.3 13.1  77:50.71
>>>>>> pbs_mom
>>>>>>
>>>>>>
>>>>>> We came across this posting and not sure if this is relevant:
>>>>>>
>>>>>> http://comments.gmane.org/gmane.comp.clustering.torque.user/13557
>>>>>>
>>>>>>
>>>>>> Thanks for looking into this.
>>>>>>
>>>>>> Steven.
>>>>>>
>>>>>>
>>>>>> On 12/06/2013 09:04 AM, David Beer wrote:
>>>>>>
>>>>>> The issue is that in some versions of libc, the pthread stack size
>>>>>> will default to 1000 * <the value set in ulimit -s>, even though TORQUE
>>>>>> specifies what stack size each thread should have. I will work to get a
>>>>>> list of the versions of libc that have this bug. Ken is the one that
>>>>>> discovered this defect, so I'll ask him for the info or ask him to post the
>>>>>> info.
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 6, 2013 at 9:02 AM, Gus Correa <gus at ldeo.columbia.edu>wrote:
>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> For the benefit of all Torque users,
>>>>>>> could you please disclose all combinations of libc versions
>>>>>>> and Torque versions that have this problem?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Gus Correa
>>>>>>>
>>>>>>> On 12/05/2013 08:52 PM, David Beer wrote:
>>>>>>> > Steven,
>>>>>>> >
>>>>>>> > What OS and version of the pthread library (libc) do you have? We
>>>>>>> know
>>>>>>> > of a rather large memory leak related to different versions these
>>>>>>> libraries.
>>>>>>> >
>>>>>>> >
>>>>>>> > On Thu, Dec 5, 2013 at 12:01 PM, Steven Lo <slo at cacr.caltech.edu
>>>>>>>  > <mailto:slo at cacr.caltech.edu>> wrote:
>>>>>>> >
>>>>>>> >
>>>>>>> >     Hi,
>>>>>>> >
>>>>>>> >     We've discovered that pbs_mom on most nodes are using over 3GB
>>>>>>> of
>>>>>>> >     memory.
>>>>>>> >     Is there a known memory leak issue for version 4.1.5.1?  If
>>>>>>> so, is there
>>>>>>> >     a patch for
>>>>>>> >     it or we have to upgrade to other version like 4.1.7 or
>>>>>>> 4.2.6.1?
>>>>>>> >
>>>>>>> >     Thanks in advance for your suggestion.
>>>>>>> >
>>>>>>> >     Steven.
>>>>>>> >
>>>>>>> >     _______________________________________________
>>>>>>> >     torqueusers mailing list
>>>>>>>  >     torqueusers at supercluster.org <mailto:
>>>>>>> torqueusers at supercluster.org>
>>>>>>>  >     http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > David Beer | Senior Software Engineer
>>>>>>> > Adaptive Computing
>>>>>>> >
>>>>>>> >
>>>>>>> > _______________________________________________
>>>>>>> > torqueusers mailing list
>>>>>>> > torqueusers at supercluster.org
>>>>>>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> torqueusers mailing list
>>>>>>> torqueusers at supercluster.org
>>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>>> David Beer | Senior Software Engineer
>>>>>> Adaptive Computing
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> torqueusers mailing list
>>>>>> torqueusers at supercluster.org
>>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>  Ken Nielson
>>>>> +1 801.717.3700 <%2B1%20801.717.3700> office +1 801.717.3738<%2B1%20801.717.3738>fax
>>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>>> www.adaptivecomputing.com
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ken Nielson
>>>> +1 801.717.3700 <%2B1%20801.717.3700> office +1 801.717.3738<%2B1%20801.717.3738>fax
>>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>>> www.adaptivecomputing.com
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>
>>>
>>> --
>>> Ken Nielson
>>> +1 801.717.3700 <%2B1%20801.717.3700> office +1 801.717.3738<%2B1%20801.717.3738>fax
>>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>>> www.adaptivecomputing.com
>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>>  --
>> David Beer | Senior Software Engineer
>> Adaptive Computing
>>
>
>
>
>  --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
> _______________________________________________
> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131211/e35a8e94/attachment-0001.html 


More information about the torqueusers mailing list