Bug 35 - Pbs_server writes the wrong PID number to $pbs_home/server_priv/server.lock
: Pbs_server writes the wrong PID number to $pbs_home/server_priv/server.lock
Product: TORQUE
: 2.4.x
: All Linux
: P3 critical
Assigned To: David Beer
  Show dependency treegraph
Reported: 2009-11-27 14:50 MST by Denis Charland
Modified: 2009-12-04 16:40 MST (History)
2 users (show)

See Also:

Fix (4.27 KB, patch)
2009-12-04 15:59 MST, David Beer
Details | Diff


You need to log in before you can comment on or make changes to this bug.

Description Denis Charland 2009-11-27 14:50:38 MST
When pbs_server starts, it writes the wrong PID number to the
$pbs_home/server_priv/server.lock file. The PID number written to this file is
the pbs_server PID number minus 1. This prevents the /etc/init.d/pbs script to
properly stop the server. Only the scheduler is stopped.

[root@fn1 ~]# ps -ef | grep pbs_server
root     18669     1  0 12:02 ?        00:00:00 /usr/torque/sbin/pbs_server
root     19016   744  0 16:47 pts/1    00:00:00 grep pbs_server
[root@fn1 ~]# cat /var/spool/torque/server_priv/server.lock
[root@fn1 ~]#
Comment 1 Glen 2009-11-30 20:38:50 MST
It appears that pbs_server writes out this lock file before it forks itself to
put itself into the background, and the bug appears at least as far back as
later 2.3.x versions.  Some of this code was modified for "high availability"
mode where multiple pbs_servers could be monitoring the same lock file.

I am going to propose a solution to the TORQUE developers mailing list for
comments, and we should get this fixed in 2.3 and 2.4 branches (as well as
subversion trunk)
Comment 2 Glen 2009-11-30 21:05:06 MST
actually, I take my comment back.  The bug is not in the 2.3.x branch, it
appeared in 2.4.x
Comment 3 Glen 2009-11-30 21:34:53 MST
as far as I can tell, at least when not running in HA mode, the code looks like
it should do the right thing: fork, create a new session, and write the session
ID (which should be the same as the pid for the newly forked process) to the
lock file. 

I'll probably add some debugging output to my local build to see if I can track
this down.
Comment 4 David Beer 2009-12-01 13:46:52 MST
I looked into this and I have fixed it.  For normal mode, the problem is that
the pid for the server wasn't updated after the last fork, thus it had the one
off problem.

For high availability mode (with --enable-high-availability configured) the
problem was the it didn't write anything to the lock file at all.  

Both of these problems have been corrected in a patch I created that is being
reviewed for check-in.


David Beer
Comment 5 Denis Charland 2009-12-03 19:15:28 MST
David, could you post the patch here after it has been reviewed for check-in.
Comment 6 David Beer 2009-12-04 13:32:04 MST
Sure, once we clear the patch I will post it here.
Comment 7 David Beer 2009-12-04 15:59:58 MST
Created an attachment (id=21) [details]

This is the patch to fix this bug.