[torqueusers] Has anyone tested HA features in a VM environment

Stewart.Samuels at sanofi-aventis.com Stewart.Samuels at sanofi-aventis.com
Mon Dec 15 06:18:40 MST 2008


I have not yet had the opportunity to test the 2.3.5 release of TORQUE
with --ha.  However, I have tested fairly extensively the 2.3.4 release
using VMWare based hosts (2 masters and 1 compute node).  I have
configured them based on the TORQUE reference manual.  To date, I have
NOT been able to get HA working in what I would consider a robust enough
method that I can use it on a production system.  The issue I have had
is that when I force a VM containing the Primary Master to fail, jobs
that were executing when the failure occurs sometimes complete with the
take over master and sometimes just hang.  In addition to that, jobs
that were in the queues when the failover occurred simply stay queued
until the master is again brought up.  In other words, jobs that were in
the queues do not go into execution when the secondary master becomes
the primary.  Something gets lost here.  I can however, submit new jobs
from the new master and they do go into execution.  So some things are
definitely working, but other things definitely need further work.

BTW, I'm running RHEL 4.6 in the VMs.

	Stewart 

-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Prakash
Velayutham
Sent: Friday, December 12, 2008 4:38 PM
To: torqueusers Users
Subject: [torqueusers] Has anyone tested HA features in a VM environment

Hello All,

Has anyone here tested Torque with "--ha" in a VM (VMware based)
environment?

I tried the following:

2 VM Torque nodes running OpenSUSE 10.3, Torque-2.3.5

PBS Mom systems (physical hosts, not VMs) running Torque-2.3.5.

In this case, everything seems to run ok, until I submit a bulk of jobs,
and then I start getting errors like

pbs_iff: cannot read reply from pbs_server Cannot connect to specified
server host 'bmiclustersvc2-int'.
qsub: cannot connect to server bmiclustersvc2-int (errno=111) Connection
refused

Anyone seen this before? Any ideas what could be going wrong?

Thanks,
Prakash

On Dec 12, 2008, at 4:12 PM, Josh Butikofer wrote:

> This website may be helpful for you:
>
> http://www.clusterresources.com/torquedocs21/4.3high-
> availability.shtml
>
> It explains on how to setup high-availability and will probably do 
> what you want.
>
> Josh Butikofer
> Cluster Resources, Inc.
> #############################
>
>
> Yang Wang wrote:
>> Dear friends,
>> Is that possible to run two pbs_server daemons for the same cluster 
>> for fall-over purpose? Has someone done this? Is there a brief doc 
>> showing how to set up such a system?
>> Thanks and happy holidays!
>> Yang
>> ---------------------------------------------------------------------
>> --- _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list