[torquedev] Pbs_mom hangs

Rudolf Gabler rug at usm.lmu.de
Tue Jun 17 06:38:02 MDT 2008


Hi torque developers,

I'm trying to run torque on a 8 node itanium cluster (linux 2.6.9-47 centos
4.6 HP blade servers). Any time a job is started on a node, the pbs_mom goes
up to 100% CPU time, the job is executed but ends up in the exiting state.
Here only a qdel -p (which left the pbs_mom in 100% CPU status) or a
/etc/init.d/pbs_mom purge (which results in a normal behavior) releases the
job and the CPU usage.

To test it I setup a 1 node execution only environment and did a
        strace -etrace=desc -F -f -ff -p pid_of_pbs_mom
before I submitted a job. The result is appended to this message (the forked
processses also). The overall behavior is, that when the job goes into
execution: a huge number of "select" system calls is executed within pbs_mom
which drives the process to 100% CPU usage.

I tested with torque-2.0pl11 up to torque-2.3.0-snap.200801151629.

Can anyone help me?

Rudolf Gabler 

email:rug_at_usm.lmu.de 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pbs_mom.tar.gz
Type: video/x-flv
Size: 11910 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20080617/b077aba0/pbs_mom.tar.bin


More information about the torquedev mailing list