[torqueusers] torque moms keep adding client to okclients list every minute repeatedly

Ken Nielson knielson at adaptivecomputing.com
Mon Feb 8 12:19:55 MST 2010


Rahul,

This message is also only logged at log level 4. I need to go and look 
at this but if it is happening every couple of minutes that seems 
excessive. But again I am not sure.

Ken

Al Taufer wrote:
> I believe that this is an issue that has been corrected since version 2.3.3.  Torque would work correctly but there was way too much IS_HELLO and IS_CLUSTER_ADDRS communication activity.  I am not sure which version corrected the problem, so I would consider upgrading to the latest 2.3.10 version to eliminate it.
>
> Al Taufer
> Adaptive Computing
>
> ----- "Rahul Nabar" <rpnabar at gmail.com> wrote:
>
>   
>> On Mon, Feb 8, 2010 at 9:03 AM, Ken Nielson
>> <knielson at adaptivecomputing.com> wrote:
>>
>> Thanks Ken for your explanation!
>>     
>>> Each time the MOM sends and IS_HELLO message to the server it will
>>>       
>> reply
>>     
>>> with an IS_CLUSTER_ADDRS message. This is where the "added to
>>>       
>> okclients"
>>     
>>> message comes from. An IS_HELLO is generated when the MOM starts. It
>>>       
>> is also
>>     
>>> generated if the MOM wants to re-establish a connection with
>>>       
>> pbs_server.
>>     
>>> I just did a quick check of the code and those are the two main
>>>       
>> things I
>>     
>>> see. There is probably one or two more reasons. In general this is
>>>       
>> not an
>>     
>>> error.
>>>       
>> Ok. I was worried because it lines 2 lines for each of 300 nodes
>> every
>> other minute. That makes the mom_logs huge.
>>
>>     
>>> It is just MOM staying in sync with the cluster.
>>>       
>> Sorry, I didn't understand the question. The moms seem to be working
>> and running jobs. How do I check if or not they are in sync?
>>
>>     
>>> How many nodes are in your cluster?
>>>       
>> We have ~300 nodes.
>>
>>     
>>> What version of TORQUE are you running?
>>>       
>> Torque 2.3.3.
>>
>> It does this the first time:
>>
>> 02/08/2010 00:00:02;0002;   pbs_mom;Svr;Log;Log opened
>> 02/08/2010 00:00:02;0002;
>> pbs_mom;n/a;mom_server_check_connection;sending hello to server
>> euadmin
>> 02/08/2010 00:00:02;0002;   pbs_mom;n/a;mom_server_update_stat;status
>> update successfully sent to euadmin
>> 02/08/2010 00:00:02;0008;   pbs_mom;Job;do_rpp;got an inter-server
>> request
>> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;stream 0 version 1
>> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;command 2,
>> "CLUSTER_ADDRS", received
>> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.1 added to okclients
>> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.2 added to okclients
>> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.3 added to okclients
>> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.4 added to okclients
>> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.5 added to okclients
>> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.6 added to okclients
>> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.7 added to okclients
>> 02/08/2010 00:00:02;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.8 added to okclients
>>
>> And then repeated blocks of this every 2 minutes:
>>
>> 02/08/2010 00:09:47;0002;   pbs_mom;n/a;mom_server_update_stat;status
>> update successfully sent to euadmin
>> 02/08/2010 00:10:32;0002;
>> pbs_mom;n/a;mom_server_check_connection;sending hello to server
>> euadmin
>> 02/08/2010 00:10:32;0002;   pbs_mom;n/a;mom_server_update_stat;status
>> update successfully sent to euadmin
>> 02/08/2010 00:10:32;0008;   pbs_mom;Job;do_rpp;got an inter-server
>> request
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;stream 0 version 1
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;command 2,
>> "CLUSTER_ADDRS", received
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.1 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.2 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.3 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.4 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.5 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.6 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.7 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.8 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.9 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.10 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.11 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.12 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.13 added to okclients
>> 02/08/2010 00:10:32;0001;   pbs_mom;Job;is_request;is_request:
>> 10.0.0.14 added to okclients
>>
>> -- 
>> Rahul
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>     
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   



More information about the torqueusers mailing list