Bug 194 - pbs_serevr crashes when removing and adding nodes.
: pbs_serevr crashes when removing and adding nodes.
Status: NEW
Product: TORQUE
: 3.0.x
: PC Linux
: P5 critical
Assigned To: David Beer
  Show dependency treegraph
Reported: 2012-05-11 05:40 MDT by Roy Dragseth
Modified: 2012-05-11 06:01 MDT (History)
1 user (show)

See Also:



You need to log in before you can comment on or make changes to this bug.

Description Roy Dragseth 2012-05-11 05:40:45 MDT
This is for v3.0.5.

pbs_server will segfault if one quickly delete and create nodes

[root@hpc1 ~]# cat /tmp/removeandaddnodes.txt 
qmgr -c "delete node compute-0-0" 2> /dev/null
qmgr -c "create node compute-0-0 np=2,ntype=cluster" 2> /dev/null
qmgr -c "delete node compute-0-1" 2> /dev/null
qmgr -c "create node compute-0-1 np=2,ntype=cluster" 2> /dev/null
qmgr -c "delete node compute-0-2" 2> /dev/null
qmgr -c "create node compute-0-2 np=2,ntype=cluster" 2> /dev/null
[root@hpc1 ~]# sh -x /tmp/removeandaddnodes.txt
+ qmgr -c 'delete node compute-0-0'
+ qmgr -c 'create node compute-0-0 np=2,ntype=cluster'
+ qmgr -c 'delete node compute-0-1'
+ qmgr -c 'create node compute-0-1 np=2,ntype=cluster'
+ qmgr -c 'delete node compute-0-2'
+ qmgr -c 'create node compute-0-2 np=2,ntype=cluster'

Running pbs_server with gdb gives the following backtrace

[root@hpc1 ~]# gdb /opt/torque/sbin/pbs_server 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
Reading symbols from /opt/torque/sbin/pbs_server...(no debugging symbols
(gdb) set args -D
(gdb) run
Starting program: /opt/torque/sbin/pbs_server -D
pbs_server is up

Program received signal SIGSEGV, Segmentation fault.
0x0000000000410470 in update_nodes_file ()
Missing separate debuginfos, use: debuginfo-install torque-3.0.5-1.x86_64
(gdb) bt
#0  0x0000000000410470 in update_nodes_file ()
#1  0x0000000000425534 in mgr_node_create ()
#2  0x0000000000427485 in req_manager ()
#3  0x000000000041de9a in process_request ()
#4  0x00002aaaaaacfb39 in wait_request (waittime=<value optimized out>,
SState=0x72f438) at ../Libnet/net_server.c:507
#5  0x000000000041c03b in main_loop ()
#6  0x000000000041cd55 in main ()

This works fine with torque 2.4.11 and 4.0.1.
Comment 1 Roy Dragseth 2012-05-11 06:01:09 MDT
I forgot to mention that this error do not occur if the node list was empty. 
It only happens when you delete and create existing nodes.