[T1-admin] [torqueusers] bad connect from x.x.x.x

luca dell'agnello luca.dellagnello at cnaf.infn.it
Tue Aug 3 06:25:23 MDT 2004


To clarify our previous email, we add some details (hoping they are enough).
in our computing center we have a farm with 648 biprocessors, for a 
total of about 2300 logic CPU's (we use hyperthreading on Xeons). On 
this farm the installed torque is  1.1.0p0-1 while maui is at 3.2.6 release.
The setup is simple: we have some dedicated queues + one overflow queue 
and the scheduling process is perfomed only in user group id basis.

And now the problems we have (at least the main ones):

- pbs_mom dies "regularly" on many nodes;
- maui runs but does not schedule (or at least very slowly)  and we have 
never had all the cpus busy

 From the log files we have no clear evidences of the problem (we see 
messages stating
on the pbs_mom "bad connect from x.x.x.x:1023" and "end of file from 
addr <pbs server address>:150001\nPremature end of message from addr 
<pbs server address>:150001\n").
Since in the near future we shall buy more and more cpus, we have to 
define how to evolve our configuration,  namely create many independent 
farms or, in case the problems persist, to switch to another batch system.
So our main doubt is if these problems are due to some misconfiguration 
in our setup or there are some scalability problems in torque itself or, 
there is "simply" a bug somewhere....

Thanks for your help

luca

Alessandro Italiano wrote:

>Hi,
>
>refering to this mail 
>
>Mon, 22 Mar 2004 12:55:52 -0700 (MST
>
>  
>
>>what causes a problem like this?
>>
>>03/18/2004 11:09:50;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
>>from 10.0.0.160:1023
>>03/18/2004 11:09:51;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
>>from 10.0.0.160:1023
>>03/18/2004 11:09:51;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
>>from 10.0.0.160:1023
>>03/18/2004 11:09:52;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
>>from 10.0.0.160:1023
>>03/18/2004 11:09:53;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
>>from 10.0.0.160:1023
>>03/18/2004 11:09:53;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
>>from 10.0.0.160:1023
>>
>>
>>Occasionally this happens to a node, and any job trying to use that node
>>will get stuck in the queue trying to start.
>>
>>    
>>
>
>we are encountering the same problem.
>
>Have you fixed it ?
>
>
>Ale
>
>
>_______________________________________________
>T1-admin mailing list
>T1-admin at iris.cnaf.infn.it
>https://iris.cnaf.infn.it/mailman/listinfo/t1-admin
>  
>



More information about the torqueusers mailing list