[torqueusers] torque moms keep adding client to okclients list every minute repeatedly

Al Taufer ataufer at adaptivecomputing.com
Mon Feb 8 11:29:52 MST 2010


I believe that this is an issue that has been corrected since version 2.3.3.  Torque would work correctly but there was way too much IS_HELLO and IS_CLUSTER_ADDRS communication activity.  I am not sure which version corrected the problem, so I would consider upgrading to the latest 2.3.10 version to eliminate it.

Al Taufer
Adaptive Computing

----- "Rahul Nabar" <rpnabar at gmail.com> wrote:

> On Mon, Feb 8, 2010 at 9:03 AM, Ken Nielson
> <knielson at adaptivecomputing.com> wrote:
> 
> Thanks Ken for your explanation!
> >
> > Each time the MOM sends and IS_HELLO message to the server it will
> reply
> > with an IS_CLUSTER_ADDRS message. This is where the "added to
> okclients"
> > message comes from. An IS_HELLO is generated when the MOM starts. It
> is also
> > generated if the MOM wants to re-establish a connection with
> pbs_server.
> >
> > I just did a quick check of the code and those are the two main
> things I
> > see. There is probably one or two more reasons. In general this is
> not an
> > error.
> 
> Ok. I was worried because it lines 2 lines for each of 300 nodes
> every
> other minute. That makes the mom_logs huge.
> 
> > It is just MOM staying in sync with the cluster.
> 
> Sorry, I didn't understand the question. The moms seem to be working
> and running jobs. How do I check if or not they are in sync?
> 
> >
> > How many nodes are in your cluster?
> 
> We have ~300 nodes.
> 
> >What version of TORQUE are you running?
> 
> Torque 2.3.3.
> 
> It does this the first time:
> 
> 02/08/2010 00:00:02;0002;   pbs_mom;Svr;Log;Log opened
> 02/08/2010 00:00:02;0002;
> pbs_mom;n/a;mom_server_check_connection;sending hello to server
> euadmin
> 02/08/2010 00:00:02;0002;   pbs_mom;n/a;mom_server_update_stat;status
> update successfully sent to euadmin
> 02/08/2010 00:00:02;0008;   pbs_mom;Job;do_rpp;got an inter-server
> request
> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;stream 0 version 1
> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;command 2,
> "CLUSTER_ADDRS", received
> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.1 added to okclients
> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.2 added to okclients
> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.3 added to okclients
> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.4 added to okclients
> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.5 added to okclients
> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.6 added to okclients
> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.7 added to okclients
> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.8 added to okclients
> 
> And then repeated blocks of this every 2 minutes:
> 
> 02/08/2010 00:09:47;0002;   pbs_mom;n/a;mom_server_update_stat;status
> update successfully sent to euadmin
> 02/08/2010 00:10:32;0002;
> pbs_mom;n/a;mom_server_check_connection;sending hello to server
> euadmin
> 02/08/2010 00:10:32;0002;   pbs_mom;n/a;mom_server_update_stat;status
> update successfully sent to euadmin
> 02/08/2010 00:10:32;0008;   pbs_mom;Job;do_rpp;got an inter-server
> request
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;stream 0 version 1
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;command 2,
> "CLUSTER_ADDRS", received
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.1 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.2 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.3 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.4 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.5 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.6 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.7 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.8 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.9 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.10 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.11 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.12 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.13 added to okclients
> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
> 10.0.0.14 added to okclients
> 
> -- 
> Rahul
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list