[T1-admin] [torqueusers] bad connect from x.x.x.x

Amitoj G. Singh amitoj at cs.uh.edu
Tue Aug 3 12:26:27 MDT 2004


I have never been convinced if this is the "right" solution but it
seems to fix the error. We are running Torque version 1.1.0p0 and Maui
version 3.2.5. On each worker node (slave) you need to have a "config"
file under $PBS_INSTALL_DIR/mom_priv directory. Our config file looks as
follows :

$logevent 0x1ff
$clienthost master
$clienthost slave1
$clienthost slave2
$clienthost slave3
$clienthost slave4

where master is the headnode to the cluster and slave(s) are the slave
nodes. Make sure the above hostnames are listed under /etc/hosts file.
Each and every slave node that will access this node should be listed in
the "config" file.

Again, if there is a "right" solution to the above problem, i would be
more than glad to know about it ..

- Amitoj.

On Tue, 3 Aug 2004, luca dell'agnello wrote:

> To clarify our previous email, we add some details (hoping they are enough).
> in our computing center we have a farm with 648 biprocessors, for a
> total of about 2300 logic CPU's (we use hyperthreading on Xeons). On
> this farm the installed torque is  1.1.0p0-1 while maui is at 3.2.6 release.
> The setup is simple: we have some dedicated queues + one overflow queue
> and the scheduling process is perfomed only in user group id basis.
>
> And now the problems we have (at least the main ones):
>
> - pbs_mom dies "regularly" on many nodes;
> - maui runs but does not schedule (or at least very slowly)  and we have
> never had all the cpus busy
>
>  From the log files we have no clear evidences of the problem (we see
> messages stating
> on the pbs_mom "bad connect from x.x.x.x:1023" and "end of file from
> addr <pbs server address>:150001\nPremature end of message from addr
> <pbs server address>:150001\n").
> Since in the near future we shall buy more and more cpus, we have to
> define how to evolve our configuration,  namely create many independent
> farms or, in case the problems persist, to switch to another batch system.
> So our main doubt is if these problems are due to some misconfiguration
> in our setup or there are some scalability problems in torque itself or,
> there is "simply" a bug somewhere....
>
> Thanks for your help
>
> luca
>
> Alessandro Italiano wrote:
>
> >Hi,
> >
> >refering to this mail
> >
> >Mon, 22 Mar 2004 12:55:52 -0700 (MST
> >
> >
> >
> >>what causes a problem like this?
> >>
> >>03/18/2004 11:09:50;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
> >>from 10.0.0.160:1023
> >>03/18/2004 11:09:51;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
> >>from 10.0.0.160:1023
> >>03/18/2004 11:09:51;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
> >>from 10.0.0.160:1023
> >>03/18/2004 11:09:52;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
> >>from 10.0.0.160:1023
> >>03/18/2004 11:09:53;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
> >>from 10.0.0.160:1023
> >>03/18/2004 11:09:53;0001;   pbs_mom;Svr;pbs_mom;im_request, bad connect
> >>from 10.0.0.160:1023
> >>
> >>
> >>Occasionally this happens to a node, and any job trying to use that node
> >>will get stuck in the queue trying to start.
> >>
> >>
> >>
> >
> >we are encountering the same problem.
> >
> >Have you fixed it ?
> >
> >
> >Ale
> >
> >
> >_______________________________________________
> >T1-admin mailing list
> >T1-admin at iris.cnaf.infn.it
> >https://iris.cnaf.infn.it/mailman/listinfo/t1-admin
> >
> >
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
>



More information about the torqueusers mailing list