[torqueusers] pbs_mom eating memory like a maniac
Dave Jackson
jacksond at clusterresources.com
Thu Nov 10 14:30:51 MST 2005
Garrick,
You are awesome! Let me know and we'll roll it in and push out a
build.
Dave
On Thu, 2005-11-10 at 13:21 -0800, Garrick Staples wrote:
> It's in check_pwd(). I'll have a patch in a few minutes.
>
>
> On Thu, Nov 10, 2005 at 11:52:31AM -0700, Dave Jackson alleged:
> > Martin,
> >
> > The memory leak is definitely not expected behavior. Valgrind can
> > probably isolate the issue quickly. So to confirm, you run a single job
> > with repeat mpiexec calls inside and get the memory leak. Correct?
> >
> > Get us the valgrind output as soon as possible and we will see what we
> > can do.
> >
> > Dave
> >
> > On Thu, 2005-11-10 at 19:11 +0100, Martin Schaff??ner wrote:
> > > It may be I am missing something or that I configured something horribly
> > > wrong, but my pbs_moms are slowly eating all available virtual memory.
> > >
> > > But let's start slowly. I replaced my old Torque 1.1.0p4 installation with
> > > 2.0.0p0 (and now with 2.0.0p1, but no change). While doing that, I also
> > > replaced all rsh access to nodes by mpiexec which uses PBS' TM API. So far,
> > > so good. But now the pbs_moms are slowly but surely eating RAM.
> > >
> > > I configured Torque like this:
> > >
> > > ./configure --enable-clients \
> > > --prefix=%{_pbs_home} \
> > > --set-server-home=/var/spool/torque \
> > > --set-server-name-file=/var/spool/torque/default_server \
> > > --set-default-server=ko-cluster.et.uni-magdeburg.de \
> > > --enable-docs --mandir=%{_pbs_home}/man \
> > > --enable-tcl-qstat \
> > > --with-tcl \
> > > --enable-mom \
> > > --enable-server \
> > > --enable-syslog \
> > > --enable-gui \
> > > --libdir=%{_pbs_home}/%{_lib} \
> > > --x-libraries=/usr/X11R6/%{_lib}
> > >
> > > in an RPM build (just for convenience of installing Torque on the nodes) and
> > > uncommented the NO_SPOOL_OUTPUT line in server_limits.h. I then built the
> > > stuff and installed it on the master and on the nodes.
> > >
> > > After a while I noticed that jobs could no longer be executed because fork()
> > > failed due to exhausted memory, although physically lots of RAM was
> > > available. I checked processes' memory usage and found pbs_mom having about
> > > 47MB RSS and 1.5GB (!!!) virtual memory. With the nodes set to not allow
> > > overcommitment of memory, fork surely wouldn't work because 2 x 1.5GB was far
> > > more than 2GB physical RAM + some swap.
> > >
> > > I simulated this stuff using this PBS script:
> > >
> > > #!/bin/sh
> > > for i in `seq 1000`; do
> > > mpiexec -comm none /bin/true
> > > done
> > >
> > > From a clean restart of pbs_mom virtual memory usage of pbs_mom grew to about
> > > 390MB on the affected nodes. This is a little [tm] too much, I guess. The
> > > same happened if I used pbsdsh instead of mpiexec.
> > >
> > > I know I will have too check what happens using valgrind, but I am in a hurry
> > > with a lot of other things at the moment.
> > >
> > > Is this all known or even expected behavior? What can be done to make pbs_mom
> > > behave?
> > >
> > > Regards,
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list