[torqueusers] Problem with HA and job IDs
Ken Nielson
knielson at adaptivecomputing.com
Wed Jun 23 10:14:39 MDT 2010
On 06/23/2010 10:13 AM, Glen Beane wrote:
> On Wed, Jun 23, 2010 at 11:22 AM, Ken Nielson
> <knielson at adaptivecomputing.com> wrote:
>
>> On 06/23/2010 01:59 AM, Drew Leske wrote:
>>
>>> Hi all,
>>>
>>> We've got a high-availability scenario happening with two Torque servers
>>> and for the most part it's pretty good. The failover is pretty slick
>>> and so on. But we have found a problem for which I can't find a
>>> solution.
>>>
>>> We have two servers, named A and B. They both run "pbs_server --ha" and
>>> whichever one is active opens up port 15001 for incoming requests. If I
>>> kill server A, B picks it up pretty quickly.
>>>
>>> Job IDs are all suffixed with A's FQDN, which is consistent with the
>>> documentation, since A is the first listed server in all configurations.
>>> Even if B is active and A is dead, new jobs will have IDs such as
>>> 498243.A.uvic.ca.
>>>
>>> It turns out this is a problem for client nodes, even if they are
>>> properly configured with /var/spool/torque/server_name containing "A,B".
>>> If specifying a job ID "49824.A.uvic.ca" when B is active, this fails:
>>>
>>> C$ qstat 85112.A
>>> Cannot connect to specified server host 'A'
>>> qstat: cannot connect to server A (errno=111) Connection refused
>>>
>>> If only "85112" is specified, then it works as expected. It's only when
>>> the full job ID is used that it fails. "qstat" to simply list the jobs
>>> works fine.
>>>
>>> I don't see any way to override the job IDs, although I would personally
>>> prefer it if I could make the job IDs not use the server name. But we
>>> have third-party software that cannot be reconfigured to use just the
>>> numeric part, so the failover breaks this software, and it seems odd to
>>> me in any case that I can't specify the full job ID if in a failover
>>> situation.
>>>
>>> I didn't see anything in the archives or documentation that deals with
>>> this. Any thoughts?
>>>
>>> Thanks,
>>> Drew.
>>>
>>>
>>> Drew Leske, Senior Systems Administrator | dleske at uvic.ca
>>> Unix Services Team, University Systems | 250-472-5055 (office)
>>> University of Victoria | 250-588-4311 (cel)
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>> Did you try just specifying the job number without the host portion of
>> the id?
>>
> Hi Ken,
>
> this was in his email:
>
> If only "85112" is specified, then it works as expected. It's only when
> the full job ID is used that it fails.
>
>
Thanks for pointing that out. I need to work on my speed reading.
Ken
More information about the torqueusers
mailing list