[torqueusers] Problem with HA and job IDs

Drew Leske dleske at uvic.ca
Wed Jun 23 13:03:52 MDT 2010


On Wed, Jun 23, 2010 at 09:24:12AM -0700, Glen Beane wrote:
> On Wed, Jun 23, 2010 at 12:14 PM, Ken Nielson
> <knielson at adaptivecomputing.com> wrote:
> > On 06/23/2010 10:13 AM, Glen Beane wrote:
> >> On Wed, Jun 23, 2010 at 11:22 AM, Ken Nielson
> >> <knielson at adaptivecomputing.com> ?wrote:
> >>
> >>> On 06/23/2010 01:59 AM, Drew Leske wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> We've got a high-availability scenario happening with two Torque servers
> >>>> and for the most part it's pretty good. ?The failover is pretty slick
> >>>> and so on. ?But we have found a problem for which I can't find a
> >>>> solution.
> >>>>
> >>>> We have two servers, named A and B. ?They both run "pbs_server --ha" and
> >>>> whichever one is active opens up port 15001 for incoming requests. ?If I
> >>>> kill server A, B picks it up pretty quickly.
> >>>>
> >>>> Job IDs are all suffixed with A's FQDN, which is consistent with the
> >>>> documentation, since A is the first listed server in all configurations.
> >>>> Even if B is active and A is dead, new jobs will have IDs such as
> >>>> 498243.A.uvic.ca.
> >>>>
> >>>> It turns out this is a problem for client nodes, even if they are
> >>>> properly configured with /var/spool/torque/server_name containing "A,B".
> >>>> If specifying a job ID "49824.A.uvic.ca" when B is active, this fails:
> >>>>
> >>>> ? ? C$ qstat 85112.A
> >>>> ? ? Cannot connect to specified server host 'A'
> >>>> ? ? qstat: cannot connect to server A (errno=111) Connection refused
> >>>>
> >>>> If only "85112" is specified, then it works as expected. ?It's only when
> >>>> the full job ID is used that it fails. ?"qstat" to simply list the jobs
> >>>> works fine.
> >>>>
> >>>> I don't see any way to override the job IDs, although I would personally
> >>>> prefer it if I could make the job IDs not use the server name. ?But we
> >>>> have third-party software that cannot be reconfigured to use just the
> >>>> numeric part, so the failover breaks this software, and it seems odd to
> >>>> me in any case that I can't specify the full job ID if in a failover
> >>>> situation.
> >>>>
> >>>> I didn't see anything in the archives or documentation that deals with
> >>>> this. ?Any thoughts?
> >>>>
> >>>> Thanks,
> >>>> Drew.
> >>>>
> >>>>
> >>>> Drew Leske, Senior Systems Administrator ? ?| dleske at uvic.ca
> >>>> Unix Services Team, University Systems ? ? ?| 250-472-5055 (office)
> >>>> University of Victoria ? ? ? ? ? ? ? ? ? ? ?| 250-588-4311 (cel)
> >>>> _______________________________________________
> >>>> torqueusers mailing list
> >>>> torqueusers at supercluster.org
> >>>> http://www.supercluster.org/mailman/listinfo/torqueusers
> >>>>
> >>>>
> >>> Did you try just specifying the job number without the host portion of
> >>> the id?
> >>>
> >> Hi Ken,
> >>
> >> this was in his email:
> >>
> >> If only "85112" is specified, then it works as expected. ?It's only when
> >> the full job ID is used that it fails.
> >>
> >>
> > Thanks for pointing that out. I need to work on my speed reading.
> 
> :)
> 
> 
> It seems like if you don't specify a host then the client uses the
> default server, which is whatever HA server is active.   Looks like we
> need some more logic that checks to see if a host specified is an
> "inactive" HA server and route the request to the active HA server...

Dang.  I saw four responses to my question and thought "Great!  Four
solutions!"  Heh oh well.

I don't see a workaround without code changes--if somebody else does
please let me know.  I was having to monitor failover hadn't taken place
so that certain client nodes continued to function.

I have implemented a patch that appears to be working.  In the logic for
selecting the fallback server, the unpatched logic is:

  if the server can't be contacted
    if no server was specified
      if a fallback server exists
        set server = fallback server
        goto start

The patched logic is:

  if the server can't be contacted
    if no server was specified or the default server was specified
      if a fallback server exists
        ...

This may not be completely correct--I'm supposed to leave on vacation
next week and so don't have time to really put in the work.  It concerns
me that the code I added for "if the default server was specified" is a
simple string comparison, and if Torque supports more than one fallback
server, then this won't work with that.

Anyway, here is the patch.  I have verified it works against 2.4.6
(ignore the host it was originally written on).  We are currently using
this on one client host where another product relied on being able to
use full job IDs.

Cheers,
Drew.

--- torque-2.4.8-orig/src/lib/Libcmds/cnt2server.c  2009-11-30
10:20:44.000000000 -0800
+++ torque-2.4.8/src/lib/Libcmds/cnt2server.c 2010-06-23
11:16:33.000000000 -0700
@@ -230,7 +230,7 @@
         {
         if (errno == ECONNREFUSED)
           {
-          if ((Server == NULL) || (Server[0] == '\0'))
+          if ((Server == NULL) || (Server[0] == '\0') ||
(strcmp(Server,pbs_default()) == 0))
             {
             char *fbserver;






Drew Leske, Senior Systems Administrator    | dleske at uvic.ca
Unix Services Team, University Systems      | 250-472-5055 (office)
University of Victoria                      | 250-588-4311 (cel)


More information about the torqueusers mailing list