[torqueusers] Stale connection error in log file of torque server
k.mohammad1 at physics.ox.ac.uk
Thu Oct 27 05:20:33 MDT 2011
We are using torque and maui for glite based grid cluster. Torque and maui are installed on a separate virtual machine with 4GB RAM and two cpu and almost no grid software on it. We are using version torque-2.3.13-1.el5 and maui-3.2.6p21-snap.1234905291.5.el5 . We are seeing some stability issues with torque and maui without any apparent reason. Maui kept hanging intermittently and coming back without any intervention.
ERROR: lost connection to server
ERROR: cannot request service (status)
It seems that it is not able to contact torque server because torque server is busy. I can see some entries in log file like
Oct 27 07:51:43 t2torque02 pbs_server: LOG_ERROR::wait_request, connection 25 to host 0 has timed out after 900 seconds - closing stale connection
Oct 27 08:35:13 t2torque02 pbs_server: LOG_ERROR::wait_request, connection 14 to host 2734753079 has timed out after 900 seconds - closing stale connection
Where 2734753079 is IP address of torque server itself. Load on torque server is not high at all and it has enough free RAM also. Any suggestion please ?
More information about the torqueusers