[torqueusers] sudden pbs_server & pbs_mom segfaults

Dimitris Zilaskos dzila at tassadar.physics.auth.gr
Fri May 22 10:30:56 MDT 2009


And another one just happened:

Program received signal SIGSEGV, Segmentation fault.
0x0806016c in tdelete ()
(gdb) bt
#0  0x0806016c in tdelete ()
Cannot access memory at address 0xffffad48

Dimitris Zilaskos wrote:
> Hi,
> 
> The least few weeks I have been encountering constant crashes of the 
> pbs_server and pbs_mom processes. The cluster has been running without 
> problems for the last year and I cannot think of any change that could 
> cause this.
> 
> I am tyring this:
> 
> PBSDEBUG=yes
> PBSLOGLEVEL=7
> PBSCOREDUMP=yes
> export PBSDEBUG PBSLOGLEVEL PBSCOREDUMP
> gdb /usr/sbin/pbs_server
> 
> [root at ce01 /]# gdb /usr/sbin/pbs_server
> GNU gdb Red Hat Linux (6.3.0.0-1.153.el4_6.1rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you 
> are
> welcome to change it and/or distribute copies of it under certain 
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (no debugging symbols found)
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
> 
> (gdb) set height 0
> (gdb) handle SIGPIPE nostop noprint pass
> Signal        Stop      Print   Pass to program Description
> SIGPIPE       No        No      Yes             Broken pipe
> (gdb) run
> Starting program: /usr/sbin/pbs_server
> Reading symbols from shared object read from target memory...(no 
> debugging symbols found)...done.
> Loaded system supplied DSO at 0xffffe000
> (no debugging symbols found)
> (no debugging symbols found)
> (no debugging symbols found)
> (no debugging symbols found)
> (no debugging symbols found)
> (no debugging symbols found)
> pbs_server is up
> PBS_Server: Connection refused (111) in contact_sched, Could not contact 
> Scheduler - port 15004 cannot bind to port 1023 in client_to_svr - 
> connection refused
> Detaching after fork from child process 20192.
> aDetaching after fork from child process 31919.
> Detaching after fork from child process 32204.
> Detaching after fork from child process 32205.
> Detaching after fork from child process 2994.
> Detaching after fork from child process 6856.
> Detaching after fork from child process 7612.
> Detaching after fork from child process 8244.
> Detaching after fork from child process 12680.
> Detaching after fork from child process 29571.
> Detaching after fork from child process 31660.
> 
> This is the only thing I captured so far. No core file in 
> /var/spool/pbs/server_priv though.
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x0806016c in tdelete ()
> (gdb) bt
> #0  0x0806016c in tdelete ()
> #1  0x0805befb in tdelete ()
> #2  0x007efc7b in wait_request () from /usr/lib/libtorque.so.2
> #3  0x0805aafe in tdelete ()
> #4  0x00594de3 in __libc_start_main () from /lib/tls/libc.so.6
> #5  0x0804c7f1 in ?? ()
> 
> [root at ce01 /]# rpm -qa|grep torque
> glite-yaim-torque-utils-4.0.2-2.noarch
> torque-docs-2.3.0-snap.200801151629.2cri.slc4.i386
> lam-devel-7.1.3-2.torque.2.3.0.i386
> torque-client-2.3.0-snap.200801151629.2cri.slc4.i386
> lam-runtime-7.1.3-2.torque.2.3.0.i386
> torque-mom-2.3.0-snap.200801151629.2cri.slc4.i386
> glite-yaim-torque-server-4.0.1-5.noarch
> torque-server-2.3.0-snap.200801151629.2cri.slc4.i386
> openmpi-1.1-2.sl4.torque.2.3.0.i386
> torque-2.3.0-snap.200801151629.2cri.slc4.i386
> torque-devel-2.3.0-snap.200801151629.2cri.slc4.i386
> lam-extras-7.1.3-2.torque.2.3.0.i386
> [root at ce01 /]# cat /etc/redhat-release
> Scientific Linux SL release 4.6 (Beryllium)
> 
> For pbs_mom no backtrace so far, I have this latest log:
> 
> May 22 16:31:34 wn017 pbs_mom: Success (0) in obit_reply, DIS_reply_read 
> failed, rc=11 sock=10
> May 22 16:31:34 wn017 pbs_mom: Connection refused (111) in 
> scan_for_exiting, cannot bind to port 1023 in client_to_svr - connection 
> refused
> May 22 16:32:06 wn017 last message repeated 32 times
> May 22 16:32:29 wn017 last message repeated 23 times
> May 22 16:32:35 wn017 kernel: pbs_mom[3585]: segfault at 
> 0000000000000004 rip 0000000000417585 rsp 0000007fbffff1b0 error 4
> 
> [root at wn017 log]# rpm -qa|grep torque
> torque-docs-2.3.6-1cri.slc4.x86_64
> torque-client-2.3.6-1cri.slc4.x86_64
> glite-yaim-torque-client-4.0.1-1.noarch
> torque-2.3.6-1cri.slc4.x86_64
> torque-devel-2.3.6-1cri.slc4.x86_64
> torque-mom-2.3.6-1cri.slc4.x86_64
> 
> [root at wn017 log]# cat /etc/redhat-release
> Scientific Linux SL release 4.5 (Beryllium)
> 
> I would appreciate any suggestions on how to proceed to tackle this 
> problem. So far the problem was in 4 nodes specific only (two pbs_server 
> nodes +2 worker nodes), but today it spreaded to another worker node and 
> I am starting to see a trend here.
> 
> 
> Cheers,
> 



More information about the torqueusers mailing list