[torqueusers] pbs_mom eating memory like a maniac
Martin Schafföner
martin.schaffoener at e-technik.uni-magdeburg.de
Thu Nov 10 11:11:22 MST 2005
It may be I am missing something or that I configured something horribly
wrong, but my pbs_moms are slowly eating all available virtual memory.
But let's start slowly. I replaced my old Torque 1.1.0p4 installation with
2.0.0p0 (and now with 2.0.0p1, but no change). While doing that, I also
replaced all rsh access to nodes by mpiexec which uses PBS' TM API. So far,
so good. But now the pbs_moms are slowly but surely eating RAM.
I configured Torque like this:
./configure --enable-clients \
--prefix=%{_pbs_home} \
--set-server-home=/var/spool/torque \
--set-server-name-file=/var/spool/torque/default_server \
--set-default-server=ko-cluster.et.uni-magdeburg.de \
--enable-docs --mandir=%{_pbs_home}/man \
--enable-tcl-qstat \
--with-tcl \
--enable-mom \
--enable-server \
--enable-syslog \
--enable-gui \
--libdir=%{_pbs_home}/%{_lib} \
--x-libraries=/usr/X11R6/%{_lib}
in an RPM build (just for convenience of installing Torque on the nodes) and
uncommented the NO_SPOOL_OUTPUT line in server_limits.h. I then built the
stuff and installed it on the master and on the nodes.
After a while I noticed that jobs could no longer be executed because fork()
failed due to exhausted memory, although physically lots of RAM was
available. I checked processes' memory usage and found pbs_mom having about
47MB RSS and 1.5GB (!!!) virtual memory. With the nodes set to not allow
overcommitment of memory, fork surely wouldn't work because 2 x 1.5GB was far
more than 2GB physical RAM + some swap.
I simulated this stuff using this PBS script:
#!/bin/sh
for i in `seq 1000`; do
mpiexec -comm none /bin/true
done
From a clean restart of pbs_mom virtual memory usage of pbs_mom grew to about
390MB on the affected nodes. This is a little [tm] too much, I guess. The
same happened if I used pbsdsh instead of mpiexec.
I know I will have too check what happens using valgrind, but I am in a hurry
with a lot of other things at the moment.
Is this all known or even expected behavior? What can be done to make pbs_mom
behave?
Regards,
--
Martin Schafföner
Cognitive Systems Group, Institute of Electronics, Signal Processing and
Communication Technologies, Department of Electrical Engineering,
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063
More information about the torqueusers
mailing list