[torqueusers] Problem with HA and job IDs
dleske at uvic.ca
Wed Jun 23 01:59:06 MDT 2010
We've got a high-availability scenario happening with two Torque servers
and for the most part it's pretty good. The failover is pretty slick
and so on. But we have found a problem for which I can't find a
We have two servers, named A and B. They both run "pbs_server --ha" and
whichever one is active opens up port 15001 for incoming requests. If I
kill server A, B picks it up pretty quickly.
Job IDs are all suffixed with A's FQDN, which is consistent with the
documentation, since A is the first listed server in all configurations.
Even if B is active and A is dead, new jobs will have IDs such as
It turns out this is a problem for client nodes, even if they are
properly configured with /var/spool/torque/server_name containing "A,B".
If specifying a job ID "49824.A.uvic.ca" when B is active, this fails:
C$ qstat 85112.A
Cannot connect to specified server host 'A'
qstat: cannot connect to server A (errno=111) Connection refused
If only "85112" is specified, then it works as expected. It's only when
the full job ID is used that it fails. "qstat" to simply list the jobs
I don't see any way to override the job IDs, although I would personally
prefer it if I could make the job IDs not use the server name. But we
have third-party software that cannot be reconfigured to use just the
numeric part, so the failover breaks this software, and it seems odd to
me in any case that I can't specify the full job ID if in a failover
I didn't see anything in the archives or documentation that deals with
this. Any thoughts?
Drew Leske, Senior Systems Administrator | dleske at uvic.ca
Unix Services Team, University Systems | 250-472-5055 (office)
University of Victoria | 250-588-4311 (cel)
More information about the torqueusers