[torqueusers] 4.1.5.1 memory leak

Ken Nielson knielson at adaptivecomputing.com
Tue Dec 10 11:27:49 MST 2013


Steven,

We found the problem and have fixed it. You will find the fix in Github for
4.2-dev, 4.5.0 and master.

The fix was made on December 6 by me (knielson). The description is "Fixed
a problem where the stack size was not getting set"

The commit hashes are as follows:

master     9c584930a20cd6d99b1690b9fd82747d794e021b
4.2-dev    f78ae1d290dd9279af8b4f20a74e188c61ed22d3
4.5.0        12c8e5b6deeb8ea4af6f46767b21eca846663a42

You can create a patch from these if you need.

Let me know if you need anything else.

Regards




On Mon, Dec 9, 2013 at 12:50 PM, Steven Lo <slo at cacr.caltech.edu> wrote:

>
> We are using libc 2.5.
>
> The stack size for our host is set to 1600000 KB.  If there are 2 threads
> then pbs_mom will use ~3GB then?
>
> stack size              (kbytes, -s) 1600000
>
>
> Is there a workaround to this problem?  If not, upgrade to 4.2.6.1 or
> 4.2.7 is recommended?
>
> Thanks.
>
> Steven.
>
>
>
> On 12/06/2013 10:55 AM, Ken Nielson wrote:
>
>  My apologies.  I have been doing more testing and it appears libc 2.18
> still has the bug.  I will be reporting this the GNU and hopefully they
> will have a fix soon.
>
>  Again my apologies for jumping the gun.
>
>
> On Fri, Dec 6, 2013 at 11:23 AM, Ken Nielson <
> knielson at adaptivecomputing.com> wrote:
>
>>   It is nice to have confirmation on what we found about the
>> pthread_attr_setstacksize. As you said, the default stack size of a new
>> thread is the value of ulimit -s * 1000. Currently the TORQUE code tries to
>> make a minimum stack size but it does not check for a maximum. So if you
>> have your stack size set to 300,000 you will get 300,000,000 bytes
>> allocated for each thread. On the mom there are two threads so you get 600
>> Mb allocated. On the server it gets huge.
>>
>> The bug in libc is that the pthread_attr_setstacksize is ignored. My
>> development box is running libc version 2.15 and the bug is still there.
>> However, I installed libc 2.18 on a CentOS 6 box and the bug appears to be
>> fixed.  I am going to modify the MOM code to also set a maximum stack size.
>>
>>  The TORQUE fix will show up in 4.2.7.
>>
>>  Regards
>>
>>
>> On Fri, Dec 6, 2013 at 10:52 AM, Steven Lo <slo at cacr.caltech.edu> wrote:
>>
>>>
>>> Hi David,
>>>
>>> The nodes which we observed are running the following version:
>>>
>>> -bash-3.2# ldd /opt/torque/sbin/pbs_mom | grep libc.so
>>>     libc.so.6 => /lib64/libc.so.6 (0x00002b18eae2a000)
>>>
>>> -bash-3.2# ldd --version
>>> ldd (GNU libc) 2.5
>>>
>>>
>>> -bash-3.2# qstat --version
>>> Version: 4.1.5.1
>>> Revision:
>>>
>>> -bash-3.2# uname -a
>>> Linux zwicky005 2.6.18-308.1.1.el5 #1 SMP Fri Feb 17 16:51:01 EST 2012
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>
>>>
>>> We see that it's using ~3G of memory:
>>>
>>> -bash-3.2# top -p 16695
>>>
>>> top - 09:46:45 up 81 days,  1:01,  1 user,  load average: 9.19, 9.17,
>>> 9.11
>>> Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
>>> Cpu(s): 74.6%us,  0.7%sy,  0.0%ni, 24.6%id,  0.0%wa,  0.0%hi,  0.0%si,
>>> 0.0%st
>>> Mem:  24675856k total, 24286304k used,   389552k free,   497860k buffers
>>> Swap: 49150856k total,  4750564k used, 44400292k free, 10798448k cached
>>>
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>> COMMAND
>>> 16695 root      15   0 3195m 3.1g 7052 S  0.3 13.1  77:50.71
>>> pbs_mom
>>>
>>>
>>> We came across this posting and not sure if this is relevant:
>>>
>>> http://comments.gmane.org/gmane.comp.clustering.torque.user/13557
>>>
>>>
>>> Thanks for looking into this.
>>>
>>> Steven.
>>>
>>>
>>> On 12/06/2013 09:04 AM, David Beer wrote:
>>>
>>> The issue is that in some versions of libc, the pthread stack size will
>>> default to 1000 * <the value set in ulimit -s>, even though TORQUE
>>> specifies what stack size each thread should have. I will work to get a
>>> list of the versions of libc that have this bug. Ken is the one that
>>> discovered this defect, so I'll ask him for the info or ask him to post the
>>> info.
>>>
>>>
>>> On Fri, Dec 6, 2013 at 9:02 AM, Gus Correa <gus at ldeo.columbia.edu>wrote:
>>>
>>>> David
>>>>
>>>> For the benefit of all Torque users,
>>>> could you please disclose all combinations of libc versions
>>>> and Torque versions that have this problem?
>>>>
>>>> Thank you,
>>>> Gus Correa
>>>>
>>>> On 12/05/2013 08:52 PM, David Beer wrote:
>>>> > Steven,
>>>> >
>>>> > What OS and version of the pthread library (libc) do you have? We know
>>>> > of a rather large memory leak related to different versions these
>>>> libraries.
>>>> >
>>>> >
>>>> > On Thu, Dec 5, 2013 at 12:01 PM, Steven Lo <slo at cacr.caltech.edu
>>>>  > <mailto:slo at cacr.caltech.edu>> wrote:
>>>> >
>>>> >
>>>> >     Hi,
>>>> >
>>>> >     We've discovered that pbs_mom on most nodes are using over 3GB of
>>>> >     memory.
>>>> >     Is there a known memory leak issue for version 4.1.5.1?  If so,
>>>> is there
>>>> >     a patch for
>>>> >     it or we have to upgrade to other version like 4.1.7 or 4.2.6.1?
>>>> >
>>>> >     Thanks in advance for your suggestion.
>>>> >
>>>> >     Steven.
>>>> >
>>>> >     _______________________________________________
>>>> >     torqueusers mailing list
>>>>  >     torqueusers at supercluster.org <mailto:
>>>> torqueusers at supercluster.org>
>>>>  >     http://www.supercluster.org/mailman/listinfo/torqueusers
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > David Beer | Senior Software Engineer
>>>> > Adaptive Computing
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > torqueusers mailing list
>>>> > torqueusers at supercluster.org
>>>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>
>>>
>>>
>>>  --
>>> David Beer | Senior Software Engineer
>>> Adaptive Computing
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>
>>
>> --
>>  Ken Nielson
>> +1 801.717.3700 office +1 801.717.3738 fax
>> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
>> www.adaptivecomputing.com
>>
>>
>
>
> --
> Ken Nielson
> +1 801.717.3700 office +1 801.717.3738 fax
> 1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
> www.adaptivecomputing.com
>
>
>
> _______________________________________________
> torqueusers mailing listtorqueusers at supercluster.orghttp://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
Ken Nielson
+1 801.717.3700 office +1 801.717.3738 fax
1712 S. East Bay Blvd, Suite 300  Provo, UT  84606
www.adaptivecomputing.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20131210/1d1657cd/attachment-0001.html 


More information about the torqueusers mailing list