[torqueusers] bad connect from x.x.x.x
Bill Wichser
bill at Princeton.EDU
Wed Sep 22 10:56:44 MDT 2004
torque-1.1.0p0
maui-3.2.6
mpiexec-0.76
This has come up before with no solutions.
The message on the afflicted node is:
pbs_mom;Svr;pbs_mom;im_request, bad connect from 172
.16.0.33:1023 - unauthorized (okclients:172.16.100.1,172.16.0.37,127.0.0.1)
100.1 is the head node, listed in the clients as a $clienthost.
0.33 is the master node in the MPI code.
0.37 is this client node.
According to my understanding, the head node builds my client list from
the server_priv/nodes file and ships this to the MOM on job start. I am
using mpiexec for this startup.
Sometimes a restart of the pbs_server as well as all the pbs_moms on the
clients fixes this problem. Other times it take multiple restarts to
correct.
I have tried listing each and every node, by nodename, as a $clienthost
in the mom_priv/config file to no avail. Perhaps adding the IP address
might help. But there seems to be something wrong somewhere either in
the server or the mom, I'm not really sure which.
The situation arises when a node is rebooted or the pbs_mom gets
restarted but doesn't happen very often even in these cases.
Can anyone offer a suggestion? Is torque-1.1.0p1 a possible solution?
Are the mpiexec patches for input/output redirect really installed in
that release?
Thanks,
Bill
More information about the torqueusers
mailing list