[torqueusers] Problem with HA and job IDs

Glen Beane glen.beane at gmail.com
Wed Jun 23 10:24:12 MDT 2010


On Wed, Jun 23, 2010 at 12:14 PM, Ken Nielson
<knielson at adaptivecomputing.com> wrote:
> On 06/23/2010 10:13 AM, Glen Beane wrote:
>> On Wed, Jun 23, 2010 at 11:22 AM, Ken Nielson
>> <knielson at adaptivecomputing.com>  wrote:
>>
>>> On 06/23/2010 01:59 AM, Drew Leske wrote:
>>>
>>>> Hi all,
>>>>
>>>> We've got a high-availability scenario happening with two Torque servers
>>>> and for the most part it's pretty good.  The failover is pretty slick
>>>> and so on.  But we have found a problem for which I can't find a
>>>> solution.
>>>>
>>>> We have two servers, named A and B.  They both run "pbs_server --ha" and
>>>> whichever one is active opens up port 15001 for incoming requests.  If I
>>>> kill server A, B picks it up pretty quickly.
>>>>
>>>> Job IDs are all suffixed with A's FQDN, which is consistent with the
>>>> documentation, since A is the first listed server in all configurations.
>>>> Even if B is active and A is dead, new jobs will have IDs such as
>>>> 498243.A.uvic.ca.
>>>>
>>>> It turns out this is a problem for client nodes, even if they are
>>>> properly configured with /var/spool/torque/server_name containing "A,B".
>>>> If specifying a job ID "49824.A.uvic.ca" when B is active, this fails:
>>>>
>>>>     C$ qstat 85112.A
>>>>     Cannot connect to specified server host 'A'
>>>>     qstat: cannot connect to server A (errno=111) Connection refused
>>>>
>>>> If only "85112" is specified, then it works as expected.  It's only when
>>>> the full job ID is used that it fails.  "qstat" to simply list the jobs
>>>> works fine.
>>>>
>>>> I don't see any way to override the job IDs, although I would personally
>>>> prefer it if I could make the job IDs not use the server name.  But we
>>>> have third-party software that cannot be reconfigured to use just the
>>>> numeric part, so the failover breaks this software, and it seems odd to
>>>> me in any case that I can't specify the full job ID if in a failover
>>>> situation.
>>>>
>>>> I didn't see anything in the archives or documentation that deals with
>>>> this.  Any thoughts?
>>>>
>>>> Thanks,
>>>> Drew.
>>>>
>>>>
>>>> Drew Leske, Senior Systems Administrator    | dleske at uvic.ca
>>>> Unix Services Team, University Systems      | 250-472-5055 (office)
>>>> University of Victoria                      | 250-588-4311 (cel)
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>> Did you try just specifying the job number without the host portion of
>>> the id?
>>>
>> Hi Ken,
>>
>> this was in his email:
>>
>> If only "85112" is specified, then it works as expected.  It's only when
>> the full job ID is used that it fails.
>>
>>
> Thanks for pointing that out. I need to work on my speed reading.

:)


It seems like if you don't specify a host then the client uses the
default server, which is whatever HA server is active.   Looks like we
need some more logic that checks to see if a host specified is an
"inactive" HA server and route the request to the active HA server...


More information about the torqueusers mailing list