[torqueusers] sudden pbs_server & pbs_mom segfaults

Tom Rudwick tomr at intrinsity.com
Tue May 26 16:44:24 MDT 2009


I would check for the possibility of corrupted files in the server_priv
directory. If you can clear out and recreate the serverdb (configuration)
and/or job files on the affected nodes, you could confirm or rule out 
that cause.

Good luck,
Tom

Dimitris Zilaskos wrote:
> And another one just happened:
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x0806016c in tdelete ()
> (gdb) bt
> #0  0x0806016c in tdelete ()
> Cannot access memory at address 0xffffad48
>
> Dimitris Zilaskos wrote:
>   
>> Hi,
>>
>> The least few weeks I have been encountering constant crashes of the 
>> pbs_server and pbs_mom processes. The cluster has been running without 
>> problems for the last year and I cannot think of any change that could 
>> cause this.
>>
>> I am tyring this:
>>
>> PBSDEBUG=yes
>> PBSLOGLEVEL=7
>> PBSCOREDUMP=yes
>> export PBSDEBUG PBSLOGLEVEL PBSCOREDUMP
>> gdb /usr/sbin/pbs_server
>>
>> [root at ce01 /]# gdb /usr/sbin/pbs_server
>> GNU gdb Red Hat Linux (6.3.0.0-1.153.el4_6.1rh)
>> Copyright 2004 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and you 
>> are
>> welcome to change it and/or distribute copies of it under certain 
>> conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB.  Type "show warranty" for details.
>> This GDB was configured as "x86_64-redhat-linux-gnu"...
>> (no debugging symbols found)
>> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
>>
>> (gdb) set height 0
>> (gdb) handle SIGPIPE nostop noprint pass
>> Signal        Stop      Print   Pass to program Description
>> SIGPIPE       No        No      Yes             Broken pipe
>> (gdb) run
>> Starting program: /usr/sbin/pbs_server
>> Reading symbols from shared object read from target memory...(no 
>> debugging symbols found)...done.
>> Loaded system supplied DSO at 0xffffe000
>> (no debugging symbols found)
>> (no debugging symbols found)
>> (no debugging symbols found)
>> (no debugging symbols found)
>> (no debugging symbols found)
>> (no debugging symbols found)
>> pbs_server is up
>> PBS_Server: Connection refused (111) in contact_sched, Could not contact 
>> Scheduler - port 15004 cannot bind to port 1023 in client_to_svr - 
>> connection refused
>> Detaching after fork from child process 20192.
>> aDetaching after fork from child process 31919.
>> Detaching after fork from child process 32204.
>> Detaching after fork from child process 32205.
>> Detaching after fork from child process 2994.
>> Detaching after fork from child process 6856.
>> Detaching after fork from child process 7612.
>> Detaching after fork from child process 8244.
>> Detaching after fork from child process 12680.
>> Detaching after fork from child process 29571.
>> Detaching after fork from child process 31660.
>>
>> This is the only thing I captured so far. No core file in 
>> /var/spool/pbs/server_priv though.
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x0806016c in tdelete ()
>> (gdb) bt
>> #0  0x0806016c in tdelete ()
>> #1  0x0805befb in tdelete ()
>> #2  0x007efc7b in wait_request () from /usr/lib/libtorque.so.2
>> #3  0x0805aafe in tdelete ()
>> #4  0x00594de3 in __libc_start_main () from /lib/tls/libc.so.6
>> #5  0x0804c7f1 in ?? ()
>>
>> [root at ce01 /]# rpm -qa|grep torque
>> glite-yaim-torque-utils-4.0.2-2.noarch
>> torque-docs-2.3.0-snap.200801151629.2cri.slc4.i386
>> lam-devel-7.1.3-2.torque.2.3.0.i386
>> torque-client-2.3.0-snap.200801151629.2cri.slc4.i386
>> lam-runtime-7.1.3-2.torque.2.3.0.i386
>> torque-mom-2.3.0-snap.200801151629.2cri.slc4.i386
>> glite-yaim-torque-server-4.0.1-5.noarch
>> torque-server-2.3.0-snap.200801151629.2cri.slc4.i386
>> openmpi-1.1-2.sl4.torque.2.3.0.i386
>> torque-2.3.0-snap.200801151629.2cri.slc4.i386
>> torque-devel-2.3.0-snap.200801151629.2cri.slc4.i386
>> lam-extras-7.1.3-2.torque.2.3.0.i386
>> [root at ce01 /]# cat /etc/redhat-release
>> Scientific Linux SL release 4.6 (Beryllium)
>>
>> For pbs_mom no backtrace so far, I have this latest log:
>>
>> May 22 16:31:34 wn017 pbs_mom: Success (0) in obit_reply, DIS_reply_read 
>> failed, rc=11 sock=10
>> May 22 16:31:34 wn017 pbs_mom: Connection refused (111) in 
>> scan_for_exiting, cannot bind to port 1023 in client_to_svr - connection 
>> refused
>> May 22 16:32:06 wn017 last message repeated 32 times
>> May 22 16:32:29 wn017 last message repeated 23 times
>> May 22 16:32:35 wn017 kernel: pbs_mom[3585]: segfault at 
>> 0000000000000004 rip 0000000000417585 rsp 0000007fbffff1b0 error 4
>>
>> [root at wn017 log]# rpm -qa|grep torque
>> torque-docs-2.3.6-1cri.slc4.x86_64
>> torque-client-2.3.6-1cri.slc4.x86_64
>> glite-yaim-torque-client-4.0.1-1.noarch
>> torque-2.3.6-1cri.slc4.x86_64
>> torque-devel-2.3.6-1cri.slc4.x86_64
>> torque-mom-2.3.6-1cri.slc4.x86_64
>>
>> [root at wn017 log]# cat /etc/redhat-release
>> Scientific Linux SL release 4.5 (Beryllium)
>>
>> I would appreciate any suggestions on how to proceed to tackle this 
>> problem. So far the problem was in 4 nodes specific only (two pbs_server 
>> nodes +2 worker nodes), but today it spreaded to another worker node and 
>> I am starting to see a trend here.
>>
>>
>> Cheers,
>>
>>     
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090526/49e9d3d6/attachment.html 


More information about the torqueusers mailing list