[torqueusers] pbs_mom eating memory like a maniac

Martin Schafföner martin.schaffoener at e-technik.uni-magdeburg.de
Thu Nov 10 11:11:22 MST 2005


It may be I am missing something or that I configured something horribly 
wrong, but my pbs_moms are slowly eating all available virtual memory.

But let's start slowly. I replaced my old Torque 1.1.0p4 installation with 
2.0.0p0 (and now with 2.0.0p1, but no change). While doing that, I also 
replaced all rsh access to nodes by mpiexec which uses PBS' TM API. So far, 
so good. But now the pbs_moms are slowly but surely eating RAM.

I configured Torque like this:

./configure 	--enable-clients \
			--prefix=%{_pbs_home} \
	 		--set-server-home=/var/spool/torque  \
			--set-server-name-file=/var/spool/torque/default_server    \
			--set-default-server=ko-cluster.et.uni-magdeburg.de \
			--enable-docs --mandir=%{_pbs_home}/man \
			--enable-tcl-qstat \
			--with-tcl \
        		--enable-mom \
        		--enable-server \
			--enable-syslog \
			--enable-gui \
			--libdir=%{_pbs_home}/%{_lib} \
			--x-libraries=/usr/X11R6/%{_lib}

in an RPM build (just for convenience of installing Torque on the nodes) and 
uncommented the NO_SPOOL_OUTPUT line in server_limits.h. I then built the 
stuff and installed it on the master and on the nodes.

After a while I noticed that jobs could no longer be executed because fork() 
failed due to exhausted memory, although physically lots of RAM was 
available. I checked processes' memory usage and found pbs_mom having about 
47MB RSS and 1.5GB (!!!) virtual memory. With the nodes set to not allow 
overcommitment of memory, fork surely wouldn't work because 2 x 1.5GB was far 
more than 2GB physical RAM + some swap.

I simulated this stuff using this PBS script:

#!/bin/sh
for i in `seq 1000`; do
    mpiexec -comm none /bin/true
done

From a clean restart of pbs_mom virtual memory usage of pbs_mom grew to about 
390MB on the affected nodes. This is a little [tm] too much, I guess. The 
same happened if I used pbsdsh instead of mpiexec.

I know I will have too check what happens using valgrind, but I am in a hurry 
with a lot of other things at the moment.

Is this all known or even expected behavior? What can be done to make pbs_mom 
behave?

Regards,
-- 
Martin Schafföner

Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063


More information about the torqueusers mailing list