[torqueusers] Problem with HA and job IDs

Drew Leske dleske at uvic.ca
Wed Jun 23 01:59:06 MDT 2010


Hi all,

We've got a high-availability scenario happening with two Torque servers
and for the most part it's pretty good.  The failover is pretty slick
and so on.  But we have found a problem for which I can't find a
solution.

We have two servers, named A and B.  They both run "pbs_server --ha" and
whichever one is active opens up port 15001 for incoming requests.  If I
kill server A, B picks it up pretty quickly.  

Job IDs are all suffixed with A's FQDN, which is consistent with the
documentation, since A is the first listed server in all configurations.
Even if B is active and A is dead, new jobs will have IDs such as
498243.A.uvic.ca.

It turns out this is a problem for client nodes, even if they are
properly configured with /var/spool/torque/server_name containing "A,B".
If specifying a job ID "49824.A.uvic.ca" when B is active, this fails:

  C$ qstat 85112.A
  Cannot connect to specified server host 'A'
  qstat: cannot connect to server A (errno=111) Connection refused

If only "85112" is specified, then it works as expected.  It's only when
the full job ID is used that it fails.  "qstat" to simply list the jobs
works fine.

I don't see any way to override the job IDs, although I would personally
prefer it if I could make the job IDs not use the server name.  But we
have third-party software that cannot be reconfigured to use just the
numeric part, so the failover breaks this software, and it seems odd to
me in any case that I can't specify the full job ID if in a failover
situation.

I didn't see anything in the archives or documentation that deals with
this.  Any thoughts?

Thanks,
Drew.


Drew Leske, Senior Systems Administrator    | dleske at uvic.ca
Unix Services Team, University Systems      | 250-472-5055 (office)
University of Victoria                      | 250-588-4311 (cel)


More information about the torqueusers mailing list