[torqueusers] torque moms keep adding client to okclients list every minute repeatedly
Ken Nielson
knielson at adaptivecomputing.com
Mon Feb 8 12:19:55 MST 2010
Rahul,
This message is also only logged at log level 4. I need to go and look
at this but if it is happening every couple of minutes that seems
excessive. But again I am not sure.
Ken
Al Taufer wrote:
> I believe that this is an issue that has been corrected since version 2.3.3. Torque would work correctly but there was way too much IS_HELLO and IS_CLUSTER_ADDRS communication activity. I am not sure which version corrected the problem, so I would consider upgrading to the latest 2.3.10 version to eliminate it.
>
> Al Taufer
> Adaptive Computing
>
> ----- "Rahul Nabar" <rpnabar at gmail.com> wrote:
>
>
>> On Mon, Feb 8, 2010 at 9:03 AM, Ken Nielson
>> <knielson at adaptivecomputing.com> wrote:
>>
>> Thanks Ken for your explanation!
>>
>>> Each time the MOM sends and IS_HELLO message to the server it will
>>>
>> reply
>>
>>> with an IS_CLUSTER_ADDRS message. This is where the "added to
>>>
>> okclients"
>>
>>> message comes from. An IS_HELLO is generated when the MOM starts. It
>>>
>> is also
>>
>>> generated if the MOM wants to re-establish a connection with
>>>
>> pbs_server.
>>
>>> I just did a quick check of the code and those are the two main
>>>
>> things I
>>
>>> see. There is probably one or two more reasons. In general this is
>>>
>> not an
>>
>>> error.
>>>
>> Ok. I was worried because it lines 2 lines for each of 300 nodes
>> every
>> other minute. That makes the mom_logs huge.
>>
>>
>>> It is just MOM staying in sync with the cluster.
>>>
>> Sorry, I didn't understand the question. The moms seem to be working
>> and running jobs. How do I check if or not they are in sync?
>>
>>
>>> How many nodes are in your cluster?
>>>
>> We have ~300 nodes.
>>
>>
>>> What version of TORQUE are you running?
>>>
>> Torque 2.3.3.
>>
>> It does this the first time:
>>
>> 02/08/2010 00:00:02;0002; pbs_mom;Svr;Log;Log opened
>> 02/08/2010 00:00:02;0002;
>> pbs_mom;n/a;mom_server_check_connection;sending hello to server
>> euadmin
>> 02/08/2010 00:00:02;0002; pbs_mom;n/a;mom_server_update_stat;status
>> update successfully sent to euadmin
>> 02/08/2010 00:00:02;0008; pbs_mom;Job;do_rpp;got an inter-server
>> request
>> 02/08/2010 00:00:02;0001; pbs_mom;Job;is_request;stream 0 version 1
>> 02/08/2010 00:00:02;0001; pbs_mom;Job;is_request;command 2,
>> "CLUSTER_ADDRS", received
>> 02/08/2010 00:00:02;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.1 added to okclients
>> 02/08/2010 00:00:02;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.2 added to okclients
>> 02/08/2010 00:00:02;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.3 added to okclients
>> 02/08/2010 00:00:02;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.4 added to okclients
>> 02/08/2010 00:00:02;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.5 added to okclients
>> 02/08/2010 00:00:02;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.6 added to okclients
>> 02/08/2010 00:00:02;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.7 added to okclients
>> 02/08/2010 00:00:02;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.8 added to okclients
>>
>> And then repeated blocks of this every 2 minutes:
>>
>> 02/08/2010 00:09:47;0002; pbs_mom;n/a;mom_server_update_stat;status
>> update successfully sent to euadmin
>> 02/08/2010 00:10:32;0002;
>> pbs_mom;n/a;mom_server_check_connection;sending hello to server
>> euadmin
>> 02/08/2010 00:10:32;0002; pbs_mom;n/a;mom_server_update_stat;status
>> update successfully sent to euadmin
>> 02/08/2010 00:10:32;0008; pbs_mom;Job;do_rpp;got an inter-server
>> request
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;stream 0 version 1
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;command 2,
>> "CLUSTER_ADDRS", received
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.1 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.2 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.3 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.4 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.5 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.6 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.7 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.8 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.9 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.10 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.11 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.12 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.13 added to okclients
>> 02/08/2010 00:10:32;0001; pbs_mom;Job;is_request;is_request:
>> 10.0.0.14 added to okclients
>>
>> --
>> Rahul
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
More information about the torqueusers
mailing list