[torqueusers] Stopping passive pbs_server will stop active pbs_server

Clotho Tsang wytsang at clustertech.com
Fri Apr 26 03:12:07 MDT 2013


Thx for Derek Gottlieb, who reports the same problem and has provided his
solution.

> We've been running torque 4.1.x + Moab 7.2.x in HA mode for a few months.
 At least for our distro (SLES), the included init scripts for both torque
and Moab were clearly not written with HA in mind.  We had to make a bunch
of modifications to the init scripts to make them actually usable for
common scenarios (e.g., function to shut down both servers when needed,
shut down only the local process, etc).  For example, default functions try
to kill based on the pid file which doesn't make as much sense if run from
the passive pbs_server machine using the server.lock file that holds the
pid of the process on the other server.  Would be nice if they'd release
init scripts that were HA aware and provided functions to force a failover,
shut down all servers, etc.
>
> You'll find the Moab init scripts have similar issues.  If you run
/etc/init.d/moab stop from the passive, I think it issues a 'mschedctl -k'
that kills the active.
>
> If you'd like, I could share what we've slapped together so far.  It's
very definitely still a work in progress and will need some customization
for your environment, but let me know if you'd be interested.
>
> If it helps, I found the following method to determine which machine is
the current active:
>
> Torque pbs_server:
> qmgr -c 'list server' 2>/dev/null | grep "^Server"
>
> Moab:
> mdiag -S -v|grep "running on"




On 22 April 2013 16:30, Clotho Tsang <wytsang at clustertech.com> wrote:

> We are setting up Torque 4.1.4 + Moab 7.2.1 in HA mode, job submission and
> dispatching is fine so far.
>
> However, we found that when stopping passive pbs_server with
> "/etc/init.d/pbs_server stop",
> it will stop the active pbs_server instead. Let me show how to make this:
>
> master1# ps -ef |grep pbs
> root      67328      1  1 15:54 ?        00:00:16 /usr/sbin/pbs_server -d
> /var/spool/torque --ha -l master1:42559 -l master2:42559
>
> master2# ps -ef |grep pbs
> root      24491      1  0 16:05 ?        00:00:00 /usr/sbin/pbs_server -d
> /var/spool/torque --ha -l master1:42559 -l master2:42559
>
> Now the active pbs_server is running on master1:
> master1# qstat -a | head -2
>
> master1:
>
> Now I stop pbs_server on master2 (switching off master2 machine gets the
> same result):
>
> master2# /etc/init.d/pbs_server stop
>
> On master1, pbs_server is shutdown (Shutdown request is from mater2):
>
> master1# tail -f /var/spool/torque/server_logs/20130422
> 04/22/2013 16:14:57;0086;PBS_Server.73628;Svr;PBS_Server;Shutdown request
> from root at master2
> 04/22/2013 16:14:57;0086;PBS_Server.73628;Svr;PBS_Server;Starting to
> shutdown the server, type is Quick
> 04/22/2013 16:14:57;0002;PBS_Server.67328;Svr;PBS_Server;Server shutdown
> completed
> 04/22/2013 16:14:57;0002;PBS_Server.67328;Svr;Log;Log closed
>
> I found the shutdown behavior is triggered by qterm in
> /etc/init.d/pbs_server stop() function.
>
> stop() {
>     status pbs_server >/dev/null 2>&1
>     if [ $? -ne 0 ]; then
>         echo "pbs_server is not running."
>         exit 0
>     fi
>     echo -n "Shutting down TORQUE Server: "
>     *$BIN_PATH/qterm*
>     RET=$?
>     if [[ $RET -ne 0 ]]; then
>       killproc pbs_server -TERM
>       RET=$?
>     fi
>
>     rm -f /var/lock/subsys/pbs_server
>     echo
> }
>
> I saw there is no "qterm" in Torque earlier version. Why does qterm kill
> neighbor's pbs_server, not itself?
> Is this pbs_server init script not suitable for HA setup?
>
> Thanks.
>
>
>
> --
> Clotho Tsang
> Senior Software Engineer
> Cluster Technology Limited
> Email: clotho at clustertech.com
> Tel: (852) 2655-6129
> Fax: (852) 2994-2101
> Website: www.clustertech.com
>



-- 
Clotho Tsang
Senior Software Engineer
Cluster Technology Limited
Email: clotho at clustertech.com
Tel: (852) 2655-6129
Fax: (852) 2994-2101
Website: www.clustertech.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130426/38c3d8f5/attachment.html 


More information about the torqueusers mailing list