[torqueusers] pbs_mom eating memory like a maniac

Dave Jackson jacksond at clusterresources.com
Thu Nov 10 14:30:51 MST 2005


Garrick,

  You are awesome!  Let me know and we'll roll it in and push out a
build.

Dave

On Thu, 2005-11-10 at 13:21 -0800, Garrick Staples wrote:
> It's in check_pwd().  I'll have a patch in a few minutes.
> 
> 
> On Thu, Nov 10, 2005 at 11:52:31AM -0700, Dave Jackson alleged:
> > Martin,
> > 
> >   The memory leak is definitely not expected behavior.  Valgrind can
> > probably isolate the issue quickly.  So to confirm, you run a single job
> > with repeat mpiexec calls inside and get the memory leak.  Correct?
> > 
> >   Get us the valgrind output as soon as possible and we will see what we
> > can do.
> > 
> > Dave
> > 
> > On Thu, 2005-11-10 at 19:11 +0100, Martin Schaff??ner wrote:
> > > It may be I am missing something or that I configured something horribly 
> > > wrong, but my pbs_moms are slowly eating all available virtual memory.
> > > 
> > > But let's start slowly. I replaced my old Torque 1.1.0p4 installation with 
> > > 2.0.0p0 (and now with 2.0.0p1, but no change). While doing that, I also 
> > > replaced all rsh access to nodes by mpiexec which uses PBS' TM API. So far, 
> > > so good. But now the pbs_moms are slowly but surely eating RAM.
> > > 
> > > I configured Torque like this:
> > > 
> > > ./configure 	--enable-clients \
> > > 			--prefix=%{_pbs_home} \
> > > 	 		--set-server-home=/var/spool/torque  \
> > > 			--set-server-name-file=/var/spool/torque/default_server    \
> > > 			--set-default-server=ko-cluster.et.uni-magdeburg.de \
> > > 			--enable-docs --mandir=%{_pbs_home}/man \
> > > 			--enable-tcl-qstat \
> > > 			--with-tcl \
> > >         		--enable-mom \
> > >         		--enable-server \
> > > 			--enable-syslog \
> > > 			--enable-gui \
> > > 			--libdir=%{_pbs_home}/%{_lib} \
> > > 			--x-libraries=/usr/X11R6/%{_lib}
> > > 
> > > in an RPM build (just for convenience of installing Torque on the nodes) and 
> > > uncommented the NO_SPOOL_OUTPUT line in server_limits.h. I then built the 
> > > stuff and installed it on the master and on the nodes.
> > > 
> > > After a while I noticed that jobs could no longer be executed because fork() 
> > > failed due to exhausted memory, although physically lots of RAM was 
> > > available. I checked processes' memory usage and found pbs_mom having about 
> > > 47MB RSS and 1.5GB (!!!) virtual memory. With the nodes set to not allow 
> > > overcommitment of memory, fork surely wouldn't work because 2 x 1.5GB was far 
> > > more than 2GB physical RAM + some swap.
> > > 
> > > I simulated this stuff using this PBS script:
> > > 
> > > #!/bin/sh
> > > for i in `seq 1000`; do
> > >     mpiexec -comm none /bin/true
> > > done
> > > 
> > > From a clean restart of pbs_mom virtual memory usage of pbs_mom grew to about 
> > > 390MB on the affected nodes. This is a little [tm] too much, I guess. The 
> > > same happened if I used pbsdsh instead of mpiexec.
> > > 
> > > I know I will have too check what happens using valgrind, but I am in a hurry 
> > > with a lot of other things at the moment.
> > > 
> > > Is this all known or even expected behavior? What can be done to make pbs_mom 
> > > behave?
> > > 
> > > Regards,
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list