[torqueusers] pbs_mom's child exits immediately after starting pbs_mom

Daniel Andrzejewski andrzeje at cs.utk.edu
Tue Sep 30 08:52:39 MDT 2008


I would like to add a piece of output from 'strace -f pbs_mom', where -f flag 
says that the child processes should be traced too.

[root at frodo9 torque-2.3.3]# strace -f pbs_mom
.
.
.
bind(6, {sa_family=AF_INET, sin_port=htons(15002), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
time(NULL)                              = 1222785330
listen(6, 512)                          = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(7, {sa_family=AF_INET, sin_port=htons(15003), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
time(NULL)                              = 1222785330
listen(7, 512)                          = 0
fcntl(5, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
clone(Process 18441 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
child_tidptr=0x2b276258fdb0) = 18441
[pid 18441] getsockname(3, {sa_family=AF_INET, sin_port=htons(56711), 
sin_addr=inet_addr("172.16.0.9")}, [16]) = 0
[pid 18441] getpeername(3, {sa_family=AF_INET, sin_port=htons(389), 
sin_addr=inet_addr("172.16.2.24")}, [68719476752]) = 0
[pid 18441] fcntl(3, F_GETFD)           = 0x1 (flags FD_CLOEXEC)
[pid 18441] dup(3)                      = 8
[pid 18441] fcntl(8, F_SETFD, FD_CLOEXEC) = 0
[pid 18441] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
[pid 18441] close(3)                    = 0
[pid 18441] fcntl(9, F_GETFD)           = 0
[pid 18441] dup2(9, 3)                  = 3
[pid 18441] fcntl(3, F_SETFD, 0)        = 0
[pid 18441] close(9)                    = 0
[pid 18441] write(3, "\25\3\1\0 
\232\17\205\301f<O0\352\246\357\344Z\31&\243\361\356\2128\242\377\7O{\267\333"..., 
37 <unfinished ...>
[pid 18440] exit_group(0)               = ?
[pid 18441] <... write resumed> )       = -1 EPIPE (Broken pipe)
[pid 18441] --- SIGPIPE (Broken pipe) @ 0 (0) ---
Process 18441 detached
[root at frodo9 torque-2.3.3]#


Any help is appreciated.

Daniel

-- 
Daniel Andrzejewski
student IT Administrator
Elec Engr & Comp Science
University of Tennessee
(865) 974 - 4388 (work)

"Investment in knowledge always pays the best interest" Benjamin Franklin
-- 

Daniel Andrzejewski wrote:
> Hi All,
> 
> First of all, I tried to look for some information on the web, but 
> couldn't find anything related to my problem.
> 
> I am upgrading from Debian 3.1 to CentOS 5.2 on a cluster with 64bit 
> processors. There was no problem running torque 2.1.6. There is no 
> problem running torque 2.3.3 on a 32bit CentOS, but there is a problem 
> with 64bit CentOS.
> 
> I simply cannot start pbs_mom. It actually starts, but spawns a child 
> which immediately exits, so there's no pbs_mom on the compute nodes 
> running. I decided to take one machine, install just torque on it and 
> investigate, but I cannot find any logs of why pbs_mom's child exits.
> 
> The following is couple last lines of 'strace pbs_mom':
> 
> fcntl(4, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
> clone(child_stack=0, 
> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
> child_tidptr=0x2b9b22f05db0) = 24415
> --- SIGCHLD (Child exited) @ 0 (0) ---
> exit_group(0)                           = ?
> 
> 
> I went to troubleshooting section of torque documentation - 10.1.5 Using 
> GDB to Locate Failures. When I export the environment variable 
> PBSDEBUG=yes and start gdb with pbs_mom it runs fine and doesn't show 
> any problems:
> 
> [root at frodo9 ~]# gdb pbs_mom
> GNU gdb Red Hat Linux (6.5-37.el5rh)
> Copyright (C) 2006 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you 
> are
> welcome to change it and/or distribute copies of it under certain 
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...Using host 
> libthread_db library "/lib64/libthread_db.so.1".
> 
> (gdb) run
> Starting program: /usr/local/sbin/pbs_mom
> MOM is up
> 
> 
> The problem with using PBSDEBUG=yes (or --enable-debug flag while 
> configuration) is that pbs_mom doesn't go to background.
> 
> These are the options I use while configuring torque:
> 
> ./configure --prefix=/pkgs/torque-2.1.6
>             --with-server-home=/sw/var/torque
>             --with-pam=/lib64/security
>             --with-scp
>             --with-default-server=frodo9.sinrg.local
>             --with-sendmail=/usr/sbin/sendmail
>             --enable-syslog
>             --disable-rpp
>             --disable-gui
>             --disable-gcc-warnings
> 
> 
> Maybe the following information could help. Iexported PBSDEBUG=yes and I 
> started pbs_mom in one window and went to another window and ran some 
> diagnostics:
> 
> [root at frodo9 ~]# pbsnodes
> frodo9.sinrg.local
>      state = free
>      np = 2
>      ntype = cluster
>      status = opsys=linux,uname=Linux frodo9 2.6.18-92.el5 #1 SMP Tue 
> Jun 10 18:51:06 EDT 2008 
> x86_64,sessions=23738,nsessions=1,nusers=1,idletime=54,totmem=4156776kb,availmem=4001240kb,physmem=2059632kb,ncpus=2,loadave=0.00,gres=server:frodo9.sinrg.local,netload=59948748,state=free,jobs=? 
> 0,rectime=1222697670
> 
> [root at frodo9 ~]# momctl -d 3
> 
> Host: frodo9/frodo9.sinrg.local   Version: 2.1.6
> Server[0]: frodo9 (172.16.0.9)
>   Init Msgs Received:     1 hellos/1 cluster-addrs
>   Init Msgs Sent:         70 hellos
>   Last Msg From Server:   61 seconds (CLUSTER_ADDRS)
>   Last Msg To Server:     15 seconds
> PID:                    24809
> HomeDirectory:          /sw/var/torque/mom_priv
> MOM active:             145 seconds
> Server Update Interval: 45 seconds
> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
> Communication Model:    TCP
> NOTE:  no prolog configured
> Alarm Time:             0 of 10 seconds
> Trusted Client List:    172.16.0.9,127.0.0.1
> Configured to use /usr/bin/scp -rpB
> NOTE:  no local jobs detected
> 
> diagnostics complete
> 
> 
> The above torque is 2.1.6 version, but it shouldn't matter since it 
> behaves the same way in 2.3.3.
> 
> I also tried to compile torque 2.3.3 with 'export CFLAGS=-m32' but this 
> didn't fix the problem.
> 
> andrzeje:frodo9 /export/src/torque-2.3.3> file 
> /pkgs/torque-2.3.3/sbin/pbs_mom
> /pkgs/torque-2.3.3/sbin/pbs_mom: ELF 32-bit LSB executable, Intel 80386, 
> version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses shared 
> libs), for GNU/Linux 2.6.9, not stripped
> 
> 
> Please advise!
> 
> Thanks,
> 
> Daniel


More information about the torqueusers mailing list