[torqueusers] sudden pbs_server & pbs_mom segfaults

Dimitris Zilaskos dzila at tassadar.physics.auth.gr
Fri May 22 09:15:14 MDT 2009


Hi,

The least few weeks I have been encountering constant crashes of the 
pbs_server and pbs_mom processes. The cluster has been running without 
problems for the last year and I cannot think of any change that could 
cause this.

I am tyring this:

PBSDEBUG=yes
PBSLOGLEVEL=7
PBSCOREDUMP=yes
export PBSDEBUG PBSLOGLEVEL PBSCOREDUMP
gdb /usr/sbin/pbs_server

[root at ce01 /]# gdb /usr/sbin/pbs_server
GNU gdb Red Hat Linux (6.3.0.0-1.153.el4_6.1rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(no debugging symbols found)
Using host libthread_db library "/lib64/tls/libthread_db.so.1".

(gdb) set height 0
(gdb) handle SIGPIPE nostop noprint pass
Signal        Stop      Print   Pass to program Description
SIGPIPE       No        No      Yes             Broken pipe
(gdb) run
Starting program: /usr/sbin/pbs_server
Reading symbols from shared object read from target memory...(no 
debugging symbols found)...done.
Loaded system supplied DSO at 0xffffe000
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
pbs_server is up
PBS_Server: Connection refused (111) in contact_sched, Could not contact 
Scheduler - port 15004 cannot bind to port 1023 in client_to_svr - 
connection refused
Detaching after fork from child process 20192.
aDetaching after fork from child process 31919.
Detaching after fork from child process 32204.
Detaching after fork from child process 32205.
Detaching after fork from child process 2994.
Detaching after fork from child process 6856.
Detaching after fork from child process 7612.
Detaching after fork from child process 8244.
Detaching after fork from child process 12680.
Detaching after fork from child process 29571.
Detaching after fork from child process 31660.

This is the only thing I captured so far. No core file in 
/var/spool/pbs/server_priv though.

Program received signal SIGSEGV, Segmentation fault.
0x0806016c in tdelete ()
(gdb) bt
#0  0x0806016c in tdelete ()
#1  0x0805befb in tdelete ()
#2  0x007efc7b in wait_request () from /usr/lib/libtorque.so.2
#3  0x0805aafe in tdelete ()
#4  0x00594de3 in __libc_start_main () from /lib/tls/libc.so.6
#5  0x0804c7f1 in ?? ()

[root at ce01 /]# rpm -qa|grep torque
glite-yaim-torque-utils-4.0.2-2.noarch
torque-docs-2.3.0-snap.200801151629.2cri.slc4.i386
lam-devel-7.1.3-2.torque.2.3.0.i386
torque-client-2.3.0-snap.200801151629.2cri.slc4.i386
lam-runtime-7.1.3-2.torque.2.3.0.i386
torque-mom-2.3.0-snap.200801151629.2cri.slc4.i386
glite-yaim-torque-server-4.0.1-5.noarch
torque-server-2.3.0-snap.200801151629.2cri.slc4.i386
openmpi-1.1-2.sl4.torque.2.3.0.i386
torque-2.3.0-snap.200801151629.2cri.slc4.i386
torque-devel-2.3.0-snap.200801151629.2cri.slc4.i386
lam-extras-7.1.3-2.torque.2.3.0.i386
[root at ce01 /]# cat /etc/redhat-release
Scientific Linux SL release 4.6 (Beryllium)

For pbs_mom no backtrace so far, I have this latest log:

May 22 16:31:34 wn017 pbs_mom: Success (0) in obit_reply, DIS_reply_read 
failed, rc=11 sock=10
May 22 16:31:34 wn017 pbs_mom: Connection refused (111) in 
scan_for_exiting, cannot bind to port 1023 in client_to_svr - connection 
refused
May 22 16:32:06 wn017 last message repeated 32 times
May 22 16:32:29 wn017 last message repeated 23 times
May 22 16:32:35 wn017 kernel: pbs_mom[3585]: segfault at 
0000000000000004 rip 0000000000417585 rsp 0000007fbffff1b0 error 4

[root at wn017 log]# rpm -qa|grep torque
torque-docs-2.3.6-1cri.slc4.x86_64
torque-client-2.3.6-1cri.slc4.x86_64
glite-yaim-torque-client-4.0.1-1.noarch
torque-2.3.6-1cri.slc4.x86_64
torque-devel-2.3.6-1cri.slc4.x86_64
torque-mom-2.3.6-1cri.slc4.x86_64

[root at wn017 log]# cat /etc/redhat-release
Scientific Linux SL release 4.5 (Beryllium)

I would appreciate any suggestions on how to proceed to tackle this 
problem. So far the problem was in 4 nodes specific only (two pbs_server 
nodes +2 worker nodes), but today it spreaded to another worker node and 
I am starting to see a trend here.


Cheers,

-- 
=============================================================================
Dimitris Zilaskos
GridAUTH Operations Centre @ Aristotle University of Thessaloniki , Greece
Tel: +302310998988 Fax: +302310994309
http://www.grid.auth.gr
=============================================================================





More information about the torqueusers mailing list