[torqueusers] pbs_mom eating memory like a maniac
Garrick Staples
garrick at usc.edu
Thu Nov 10 14:21:49 MST 2005
It's in check_pwd(). I'll have a patch in a few minutes.
On Thu, Nov 10, 2005 at 11:52:31AM -0700, Dave Jackson alleged:
> Martin,
>
> The memory leak is definitely not expected behavior. Valgrind can
> probably isolate the issue quickly. So to confirm, you run a single job
> with repeat mpiexec calls inside and get the memory leak. Correct?
>
> Get us the valgrind output as soon as possible and we will see what we
> can do.
>
> Dave
>
> On Thu, 2005-11-10 at 19:11 +0100, Martin Schaff??ner wrote:
> > It may be I am missing something or that I configured something horribly
> > wrong, but my pbs_moms are slowly eating all available virtual memory.
> >
> > But let's start slowly. I replaced my old Torque 1.1.0p4 installation with
> > 2.0.0p0 (and now with 2.0.0p1, but no change). While doing that, I also
> > replaced all rsh access to nodes by mpiexec which uses PBS' TM API. So far,
> > so good. But now the pbs_moms are slowly but surely eating RAM.
> >
> > I configured Torque like this:
> >
> > ./configure --enable-clients \
> > --prefix=%{_pbs_home} \
> > --set-server-home=/var/spool/torque \
> > --set-server-name-file=/var/spool/torque/default_server \
> > --set-default-server=ko-cluster.et.uni-magdeburg.de \
> > --enable-docs --mandir=%{_pbs_home}/man \
> > --enable-tcl-qstat \
> > --with-tcl \
> > --enable-mom \
> > --enable-server \
> > --enable-syslog \
> > --enable-gui \
> > --libdir=%{_pbs_home}/%{_lib} \
> > --x-libraries=/usr/X11R6/%{_lib}
> >
> > in an RPM build (just for convenience of installing Torque on the nodes) and
> > uncommented the NO_SPOOL_OUTPUT line in server_limits.h. I then built the
> > stuff and installed it on the master and on the nodes.
> >
> > After a while I noticed that jobs could no longer be executed because fork()
> > failed due to exhausted memory, although physically lots of RAM was
> > available. I checked processes' memory usage and found pbs_mom having about
> > 47MB RSS and 1.5GB (!!!) virtual memory. With the nodes set to not allow
> > overcommitment of memory, fork surely wouldn't work because 2 x 1.5GB was far
> > more than 2GB physical RAM + some swap.
> >
> > I simulated this stuff using this PBS script:
> >
> > #!/bin/sh
> > for i in `seq 1000`; do
> > mpiexec -comm none /bin/true
> > done
> >
> > From a clean restart of pbs_mom virtual memory usage of pbs_mom grew to about
> > 390MB on the affected nodes. This is a little [tm] too much, I guess. The
> > same happened if I used pbsdsh instead of mpiexec.
> >
> > I know I will have too check what happens using valgrind, but I am in a hurry
> > with a lot of other things at the moment.
> >
> > Is this all known or even expected behavior? What can be done to make pbs_mom
> > behave?
> >
> > Regards,
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051110/883caa5e/attachment.bin
More information about the torqueusers
mailing list