[torqueusers] pbs_server crashes when deleting and adding nodes.

Roy Dragseth roy.dragseth at cc.uit.no
Tue Apr 24 06:34:00 MDT 2012


System specs:  CentOS 6.2, torque v3.0.4

pbs_server segfaults with the backtrace listed below when deleting and re-
adding nodes with qmgr:

[root at hpc1 ~]# qmgr < /tmp/addnodes.sh 
Max open servers: 10239
delete node compute-0-0
qmgr obj=compute-0-0 svr=default: Unknown node 
create node compute-0-0 np=2,ntype=cluster
delete node compute-0-1
create node compute-0-1 np=2,ntype=cluster
qmgr obj=compute-0-1 svr=default: End of File
delete node compute-0-2

[root at hpc1 ~]# cat /tmp/addnodes.sh 
delete node compute-0-0
create node compute-0-0 np=2,ntype=cluster
delete node compute-0-1
create node compute-0-1 np=2,ntype=cluster
delete node compute-0-2
create node compute-0-2 np=2,ntype=cluster

Backtrace

[root at hpc1 torque-3.0.4]# gdb /opt/torque/sbin/pbs_server 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/torque/sbin/pbs_server...done.
(gdb) run
Starting program: /opt/torque/sbin/pbs_server 
pbs_server is up

Program received signal SIGSEGV, Segmentation fault.
0x0000000000412a8c in update_nodes_file () at node_func.c:1314
1314        if (np->nd_state & INUSE_DELETED)
Missing separate debuginfos, use: debuginfo-install torque-3.0.4-1.x86_64
(gdb) bt
#0  0x0000000000412a8c in update_nodes_file () at node_func.c:1314
#1  0x0000000000430328 in mgr_node_create (preq=0x11c1e30) at 
req_manager.c:2209
#2  0x0000000000430420 in req_manager (preq=0x11c1e30) at req_manager.c:2271
#3  0x0000000000424010 in dispatch_request (sfds=14, request=0x11c1e30) at 
process_request.c:862
#4  0x0000000000423e65 in process_request (sfds=14) at process_request.c:734
#5  0x00002aaaaaad6164 in wait_request (waittime=32, SState=0x7409d8) at 
../Libnet/net_server.c:507
#6  0x0000000000420f66 in main_loop () at pbsd_main.c:1238
#7  0x0000000000421c8d in main (argc=1, argv=0x7fffffffe2d8) at pbsd_main.c:1793
(gdb) list
1309
1310      for (i = 0;i < svr_totnodes;++i)
1311        {
1312        np = pbsndmast[i];
1313
1314        if (np->nd_state & INUSE_DELETED)
1315          continue;
1316
1317        /* ... write its name, and if time-shared, append :ts */
1318
(gdb) cont
Continuing.

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.

In Rocks this means that the pbs_server crashes every time one runs rocks sync 
config which refreshes the cluster config files.


Any clues?

r.

-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no



More information about the torqueusers mailing list