[torqueusers] bad connect from x.x.x.x
Glen Beane
beaneg at umcs.maine.edu
Wed Sep 22 12:27:36 MDT 2004
This always happens on my Linux cluster when a pbs_mom is restarted
without being shutdown cleanly. For me the only fix is to restart
pbs_server, which alwasy fixes our problem.
I haven't seen this issue on our OS X cluster (although I do see other
problems on that platform :)
On Wed, 2004-09-22 at 12:56, Bill Wichser wrote:
> torque-1.1.0p0
> maui-3.2.6
> mpiexec-0.76
>
> This has come up before with no solutions.
>
> The message on the afflicted node is:
>
> pbs_mom;Svr;pbs_mom;im_request, bad connect from 172
> .16.0.33:1023 - unauthorized (okclients:172.16.100.1,172.16.0.37,127.0.0.1)
>
> 100.1 is the head node, listed in the clients as a $clienthost.
> 0.33 is the master node in the MPI code.
> 0.37 is this client node.
>
> According to my understanding, the head node builds my client list from
> the server_priv/nodes file and ships this to the MOM on job start. I am
> using mpiexec for this startup.
>
> Sometimes a restart of the pbs_server as well as all the pbs_moms on the
> clients fixes this problem. Other times it take multiple restarts to
> correct.
>
> I have tried listing each and every node, by nodename, as a $clienthost
> in the mom_priv/config file to no avail. Perhaps adding the IP address
> might help. But there seems to be something wrong somewhere either in
> the server or the mom, I'm not really sure which.
>
> The situation arises when a node is rebooted or the pbs_mom gets
> restarted but doesn't happen very often even in these cases.
>
> Can anyone offer a suggestion? Is torque-1.1.0p1 a possible solution?
> Are the mpiexec patches for input/output redirect really installed in
> that release?
>
> Thanks,
>
> Bill
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list