[torqueusers] pbs_mom's child exits immediately after starting pbs_mom

Daniel Andrzejewski andrzeje at cs.utk.edu
Mon Sep 29 09:22:00 MDT 2008


Hi All,

First of all, I tried to look for some information on the web, but couldn't find 
anything related to my problem.

I am upgrading from Debian 3.1 to CentOS 5.2 on a cluster with 64bit processors. 
There was no problem running torque 2.1.6. There is no problem running torque 
2.3.3 on a 32bit CentOS, but there is a problem with 64bit CentOS.

I simply cannot start pbs_mom. It actually starts, but spawns a child which 
immediately exits, so there's no pbs_mom on the compute nodes running. I decided 
to take one machine, install just torque on it and investigate, but I cannot 
find any logs of why pbs_mom's child exits.

The following is couple last lines of 'strace pbs_mom':

fcntl(4, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
child_tidptr=0x2b9b22f05db0) = 24415
--- SIGCHLD (Child exited) @ 0 (0) ---
exit_group(0)                           = ?


I went to troubleshooting section of torque documentation - 10.1.5 Using GDB to 
Locate Failures. When I export the environment variable PBSDEBUG=yes and start 
gdb with pbs_mom it runs fine and doesn't show any problems:

[root at frodo9 ~]# gdb pbs_mom
GNU gdb Red Hat Linux (6.5-37.el5rh)
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...Using host libthread_db 
library "/lib64/libthread_db.so.1".

(gdb) run
Starting program: /usr/local/sbin/pbs_mom
MOM is up


The problem with using PBSDEBUG=yes (or --enable-debug flag while configuration) 
is that pbs_mom doesn't go to background.

These are the options I use while configuring torque:

./configure --prefix=/pkgs/torque-2.1.6
             --with-server-home=/sw/var/torque
             --with-pam=/lib64/security
             --with-scp
             --with-default-server=frodo9.sinrg.local
             --with-sendmail=/usr/sbin/sendmail
             --enable-syslog
             --disable-rpp
             --disable-gui
             --disable-gcc-warnings


Maybe the following information could help. Iexported PBSDEBUG=yes and I started 
pbs_mom in one window and went to another window and ran some diagnostics:

[root at frodo9 ~]# pbsnodes
frodo9.sinrg.local
      state = free
      np = 2
      ntype = cluster
      status = opsys=linux,uname=Linux frodo9 2.6.18-92.el5 #1 SMP Tue Jun 10 
18:51:06 EDT 2008 
x86_64,sessions=23738,nsessions=1,nusers=1,idletime=54,totmem=4156776kb,availmem=4001240kb,physmem=2059632kb,ncpus=2,loadave=0.00,gres=server:frodo9.sinrg.local,netload=59948748,state=free,jobs=? 
0,rectime=1222697670

[root at frodo9 ~]# momctl -d 3

Host: frodo9/frodo9.sinrg.local   Version: 2.1.6
Server[0]: frodo9 (172.16.0.9)
   Init Msgs Received:     1 hellos/1 cluster-addrs
   Init Msgs Sent:         70 hellos
   Last Msg From Server:   61 seconds (CLUSTER_ADDRS)
   Last Msg To Server:     15 seconds
PID:                    24809
HomeDirectory:          /sw/var/torque/mom_priv
MOM active:             145 seconds
Server Update Interval: 45 seconds
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
NOTE:  no prolog configured
Alarm Time:             0 of 10 seconds
Trusted Client List:    172.16.0.9,127.0.0.1
Configured to use /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete


The above torque is 2.1.6 version, but it shouldn't matter since it behaves the 
same way in 2.3.3.

I also tried to compile torque 2.3.3 with 'export CFLAGS=-m32' but this didn't 
fix the problem.

andrzeje:frodo9 /export/src/torque-2.3.3> file /pkgs/torque-2.3.3/sbin/pbs_mom 

/pkgs/torque-2.3.3/sbin/pbs_mom: ELF 32-bit LSB executable, Intel 80386, version 
1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared libs), for 
GNU/Linux 2.6.9, not stripped


Please advise!

Thanks,

Daniel
-- 
Daniel Andrzejewski
student IT Administrator
Elec Engr & Comp Science
University of Tennessee
(865) 974 - 4388 (work)

"Investment in knowledge always pays the best interest" Benjamin Franklin
-- 



More information about the torqueusers mailing list