[torqueusers] pbs_mom's child exits immediately after starting pbs_mom

Daniel Andrzejewski andrzeje at cs.utk.edu
Tue Sep 30 21:15:40 MDT 2008


Hi Daniel,

Thanks for you suggestion. However, it did not solve my problem, but maybe I can 
learn from that:

[root at frodo9 torque-2.3.2-snap.200807081528]# strace -f pbs_mom
.
.
.
fcntl(4, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
clone(Process 26346 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
child_tidptr=0x2aae326a1db0) = 26346
[pid 26345] exit_group(0)               = ?
getsockname(3, 0x7fff7866e9e0, [128])   = -1 ENOTSOCK (Socket operation on 
non-socket)
fcntl(3, F_GETFD)                       = 0
dup(3)                                  = 7
fcntl(7, F_SETFD, 0)                    = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8
close(3)                                = 0
fcntl(8, F_GETFD)                       = 0
dup2(8, 3)                              = 3
fcntl(3, F_SETFD, 0)                    = 0
close(8)                                = 0
write(3, "\25\3\1\0 
\307\276jJ\207v\7\3473A\355\16\340\232\347\35\\\311\307\3472i)\5L\t\22"..., 37) 
= -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
Process 26346 detached
[root at frodo9 torque-2.3.2-snap.200807081528]#

This line may be the clue:
getsockname(3, 0x7fff7866e9e0, [128])   = -1 ENOTSOCK (Socket operation on 
non-socket)

I will try to investigate it a little bit. Will also try the latest snapshot:
torque-2.4.0-snap.200809251409


Regards,

Daniel Andrzejewski

-- 
Daniel Andrzejewski
student IT Administrator
Elec Engr & Comp Science
University of Tennessee
(865) 974 - 4388 (work)

"Investment in knowledge always pays the best interest"
Benjamin Franklin
-- 

Daniel Bourque wrote:
> I'm running a CentOS 5.2 64bit torque cluster
> I built rpms from the 2.3.2-snap.200807081528 tarball. No problems. 
> Works like champ.
> 
> Do you get build or runtime errors when trying to do that  ?
> 
> Daniel Bourque
> Sr. Systems Engineer
> WeatherData Service Inc
> An Accuweather Company
> 
> 
> 
> Daniel Andrzejewski wrote:
>> Hi All,
>>
>> First of all, I tried to look for some information on the web, but 
>> couldn't find anything related to my problem.
>>
>> I am upgrading from Debian 3.1 to CentOS 5.2 on a cluster with 64bit 
>> processors. There was no problem running torque 2.1.6. There is no 
>> problem running torque 2.3.3 on a 32bit CentOS, but there is a problem 
>> with 64bit CentOS.
>>
>> I simply cannot start pbs_mom. It actually starts, but spawns a child 
>> which immediately exits, so there's no pbs_mom on the compute nodes 
>> running. I decided to take one machine, install just torque on it and 
>> investigate, but I cannot find any logs of why pbs_mom's child exits.
>>
>> The following is couple last lines of 'strace pbs_mom':
>>
>> fcntl(4, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0
>> clone(child_stack=0, 
>> flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
>> child_tidptr=0x2b9b22f05db0) = 24415
>> --- SIGCHLD (Child exited) @ 0 (0) ---
>> exit_group(0)                           = ?
>>
>>
>> I went to troubleshooting section of torque documentation - 10.1.5 
>> Using GDB to Locate Failures. When I export the environment variable 
>> PBSDEBUG=yes and start gdb with pbs_mom it runs fine and doesn't show 
>> any problems:
>>
>> [root at frodo9 ~]# gdb pbs_mom
>> GNU gdb Red Hat Linux (6.5-37.el5rh)
>> Copyright (C) 2006 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and 
>> you are
>> welcome to change it and/or distribute copies of it under certain 
>> conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB.  Type "show warranty" for 
>> details.
>> This GDB was configured as "x86_64-redhat-linux-gnu"...Using host 
>> libthread_db library "/lib64/libthread_db.so.1".
>>
>> (gdb) run
>> Starting program: /usr/local/sbin/pbs_mom
>> MOM is up
>>
>>
>> The problem with using PBSDEBUG=yes (or --enable-debug flag while 
>> configuration) is that pbs_mom doesn't go to background.
>>
>> These are the options I use while configuring torque:
>>
>> ./configure --prefix=/pkgs/torque-2.1.6
>>             --with-server-home=/sw/var/torque
>>             --with-pam=/lib64/security
>>             --with-scp
>>             --with-default-server=frodo9.sinrg.local
>>             --with-sendmail=/usr/sbin/sendmail
>>             --enable-syslog
>>             --disable-rpp
>>             --disable-gui
>>             --disable-gcc-warnings
>>
>>
>> Maybe the following information could help. Iexported PBSDEBUG=yes and 
>> I started pbs_mom in one window and went to another window and ran 
>> some diagnostics:
>>
>> [root at frodo9 ~]# pbsnodes
>> frodo9.sinrg.local
>>      state = free
>>      np = 2
>>      ntype = cluster
>>      status = opsys=linux,uname=Linux frodo9 2.6.18-92.el5 #1 SMP Tue 
>> Jun 10 18:51:06 EDT 2008 
>> x86_64,sessions=23738,nsessions=1,nusers=1,idletime=54,totmem=4156776kb,availmem=4001240kb,physmem=2059632kb,ncpus=2,loadave=0.00,gres=server:frodo9.sinrg.local,netload=59948748,state=free,jobs=? 
>> 0,rectime=1222697670
>>
>> [root at frodo9 ~]# momctl -d 3
>>
>> Host: frodo9/frodo9.sinrg.local   Version: 2.1.6
>> Server[0]: frodo9 (172.16.0.9)
>>   Init Msgs Received:     1 hellos/1 cluster-addrs
>>   Init Msgs Sent:         70 hellos
>>   Last Msg From Server:   61 seconds (CLUSTER_ADDRS)
>>   Last Msg To Server:     15 seconds
>> PID:                    24809
>> HomeDirectory:          /sw/var/torque/mom_priv
>> MOM active:             145 seconds
>> Server Update Interval: 45 seconds
>> LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to adjust)
>> Communication Model:    TCP
>> NOTE:  no prolog configured
>> Alarm Time:             0 of 10 seconds
>> Trusted Client List:    172.16.0.9,127.0.0.1
>> Configured to use /usr/bin/scp -rpB
>> NOTE:  no local jobs detected
>>
>> diagnostics complete
>>
>>
>> The above torque is 2.1.6 version, but it shouldn't matter since it 
>> behaves the same way in 2.3.3.
>>
>> I also tried to compile torque 2.3.3 with 'export CFLAGS=-m32' but 
>> this didn't fix the problem.
>>
>> andrzeje:frodo9 /export/src/torque-2.3.3> file 
>> /pkgs/torque-2.3.3/sbin/pbs_mom
>> /pkgs/torque-2.3.3/sbin/pbs_mom: ELF 32-bit LSB executable, Intel 
>> 80386, version 1 (SYSV), for GNU/Linux 2.6.9, dynamically linked (uses 
>> shared libs), for GNU/Linux 2.6.9, not stripped
>>
>>
>> Please advise!
>>
>> Thanks,
>>
>> Daniel


More information about the torqueusers mailing list