[torqueusers] Problem with HA and job IDs

Ken Nielson knielson at adaptivecomputing.com
Wed Jun 23 09:22:38 MDT 2010


On 06/23/2010 01:59 AM, Drew Leske wrote:
> Hi all,
>
> We've got a high-availability scenario happening with two Torque servers
> and for the most part it's pretty good.  The failover is pretty slick
> and so on.  But we have found a problem for which I can't find a
> solution.
>
> We have two servers, named A and B.  They both run "pbs_server --ha" and
> whichever one is active opens up port 15001 for incoming requests.  If I
> kill server A, B picks it up pretty quickly.
>
> Job IDs are all suffixed with A's FQDN, which is consistent with the
> documentation, since A is the first listed server in all configurations.
> Even if B is active and A is dead, new jobs will have IDs such as
> 498243.A.uvic.ca.
>
> It turns out this is a problem for client nodes, even if they are
> properly configured with /var/spool/torque/server_name containing "A,B".
> If specifying a job ID "49824.A.uvic.ca" when B is active, this fails:
>
>    C$ qstat 85112.A
>    Cannot connect to specified server host 'A'
>    qstat: cannot connect to server A (errno=111) Connection refused
>
> If only "85112" is specified, then it works as expected.  It's only when
> the full job ID is used that it fails.  "qstat" to simply list the jobs
> works fine.
>
> I don't see any way to override the job IDs, although I would personally
> prefer it if I could make the job IDs not use the server name.  But we
> have third-party software that cannot be reconfigured to use just the
> numeric part, so the failover breaks this software, and it seems odd to
> me in any case that I can't specify the full job ID if in a failover
> situation.
>
> I didn't see anything in the archives or documentation that deals with
> this.  Any thoughts?
>
> Thanks,
> Drew.
>
>
> Drew Leske, Senior Systems Administrator    | dleske at uvic.ca
> Unix Services Team, University Systems      | 250-472-5055 (office)
> University of Victoria                      | 250-588-4311 (cel)
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>    
Did you try just specifying the job number without the host portion of 
the id?

Ken Nielson
Adaptive Computing


More information about the torqueusers mailing list