[torqueusers] Torque losing LDAP connection
Corey Hirschman
corey at rentec.com
Thu Apr 21 15:52:44 MDT 2005
I was wondering if there is anyone else using Torque in and LDAP environment that has maybe dealt with this issue. Every few days we get error messages saying the pbs_server process could not connect to the LDAP server. Here is an example from today:
Apr 21 16:39:23 monsterrq pbs_server: nss_ldap: reconnecting to LDAP server...
Apr 21 16:39:23 monsterrq pbs_server: nss_ldap: reconnecting to LDAP server...
Apr 21 16:39:23 monsterrq pbs_server: nss_ldap: reconnecting to LDAP server (sleeping 4 seconds)...
Apr 21 16:39:27 monsterrq pbs_server: nss_ldap: reconnecting to LDAP server (sleeping 8 seconds)...
Apr 21 16:39:35 monsterrq pbs_server: nss_ldap: reconnecting to LDAP server (sleeping 16 seconds)...
Apr 21 16:39:51 monsterrq pbs_server: nss_ldap: reconnecting to LDAP server (sleeping 32 seconds)...
Apr 21 16:40:23 monsterrq pbs_server: nss_ldap: could not hard reconnect to LDAP server - Can't contact LDAP server
During this time no one can submit jobs and the Torque server is basically dead in the water. Everything else on the machine will work normally and I can still make LDAP queries and they return fine. The servers themselves are also up and experience no problems, the problems appear to be isolated to the pbs_server process. After about 15 minutes everything starts working normally again. Sometimes this 15 minute wait is unacceptable and I have to restart the server and after that everything works fine again.
Is anyone else experiences problems such as this or perhaps has a clue as to why the pbs_server process is losing communication with the LDAP server?
Thank you,
Corey Hirschman
Renaissance Technologies
More information about the torqueusers
mailing list