[torqueusers] bad connect from x.x.x.x

Glen Beane beaneg at umcs.maine.edu
Wed Sep 22 12:27:36 MDT 2004


This always happens on my Linux cluster when a pbs_mom is restarted
without being shutdown cleanly.  For me the only fix is to restart
pbs_server, which alwasy fixes our problem.

I haven't seen this issue on our OS X cluster (although I do see other
problems on that platform :)  

On Wed, 2004-09-22 at 12:56, Bill Wichser wrote:
> torque-1.1.0p0
> maui-3.2.6
> mpiexec-0.76
> 
> This has come up before with no solutions.
> 
> The message on the afflicted node is:
> 
> pbs_mom;Svr;pbs_mom;im_request, bad connect from 172
> .16.0.33:1023 - unauthorized (okclients:172.16.100.1,172.16.0.37,127.0.0.1)
> 
> 100.1 is the head node, listed in the clients as a $clienthost.
> 0.33 is the master node in the MPI code.
> 0.37 is this client node.
> 
> According to my understanding, the head node builds my client list from 
> the server_priv/nodes file and ships this to the MOM on job start.  I am 
> using mpiexec for this startup.
> 
> Sometimes a restart of the pbs_server as well as all the pbs_moms on the 
> clients fixes this problem.  Other times it take multiple restarts to 
> correct.
> 
> I have tried listing each and every node, by nodename, as a $clienthost 
> in the mom_priv/config file to no avail.  Perhaps adding the IP address 
> might help.  But there seems to be something wrong somewhere either in 
> the server or the mom, I'm not really sure which.
> 
> The situation arises when a node is rebooted or the pbs_mom gets 
> restarted but doesn't happen very often even in these cases.
> 
> Can anyone offer a suggestion?  Is torque-1.1.0p1 a possible solution? 
> Are the mpiexec patches for input/output redirect really installed in 
> that release?
> 
> Thanks,
> 
> Bill
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list