[torqueusers] pbs_mom eating memory like a maniac

Dave Jackson jacksond at clusterresources.com
Thu Nov 10 11:52:31 MST 2005


Martin,

  The memory leak is definitely not expected behavior.  Valgrind can
probably isolate the issue quickly.  So to confirm, you run a single job
with repeat mpiexec calls inside and get the memory leak.  Correct?

  Get us the valgrind output as soon as possible and we will see what we
can do.

Dave

On Thu, 2005-11-10 at 19:11 +0100, Martin Schafföner wrote:
> It may be I am missing something or that I configured something horribly 
> wrong, but my pbs_moms are slowly eating all available virtual memory.
> 
> But let's start slowly. I replaced my old Torque 1.1.0p4 installation with 
> 2.0.0p0 (and now with 2.0.0p1, but no change). While doing that, I also 
> replaced all rsh access to nodes by mpiexec which uses PBS' TM API. So far, 
> so good. But now the pbs_moms are slowly but surely eating RAM.
> 
> I configured Torque like this:
> 
> ./configure 	--enable-clients \
> 			--prefix=%{_pbs_home} \
> 	 		--set-server-home=/var/spool/torque  \
> 			--set-server-name-file=/var/spool/torque/default_server    \
> 			--set-default-server=ko-cluster.et.uni-magdeburg.de \
> 			--enable-docs --mandir=%{_pbs_home}/man \
> 			--enable-tcl-qstat \
> 			--with-tcl \
>         		--enable-mom \
>         		--enable-server \
> 			--enable-syslog \
> 			--enable-gui \
> 			--libdir=%{_pbs_home}/%{_lib} \
> 			--x-libraries=/usr/X11R6/%{_lib}
> 
> in an RPM build (just for convenience of installing Torque on the nodes) and 
> uncommented the NO_SPOOL_OUTPUT line in server_limits.h. I then built the 
> stuff and installed it on the master and on the nodes.
> 
> After a while I noticed that jobs could no longer be executed because fork() 
> failed due to exhausted memory, although physically lots of RAM was 
> available. I checked processes' memory usage and found pbs_mom having about 
> 47MB RSS and 1.5GB (!!!) virtual memory. With the nodes set to not allow 
> overcommitment of memory, fork surely wouldn't work because 2 x 1.5GB was far 
> more than 2GB physical RAM + some swap.
> 
> I simulated this stuff using this PBS script:
> 
> #!/bin/sh
> for i in `seq 1000`; do
>     mpiexec -comm none /bin/true
> done
> 
> From a clean restart of pbs_mom virtual memory usage of pbs_mom grew to about 
> 390MB on the affected nodes. This is a little [tm] too much, I guess. The 
> same happened if I used pbsdsh instead of mpiexec.
> 
> I know I will have too check what happens using valgrind, but I am in a hurry 
> with a lot of other things at the moment.
> 
> Is this all known or even expected behavior? What can be done to make pbs_mom 
> behave?
> 
> Regards,



More information about the torqueusers mailing list