[torqueusers] momctl errors
Kashif Mohammad
k.mohammad1 at physics.ox.ac.uk
Tue Dec 18 05:22:39 MST 2012
Hi
I am troubleshooting a problem at our cluster where some times jobs at compute nodes lost contact with torque server. In this case jobs keep running at compute node while torque server thinks that job has been finished.
There are lot of this errors in log file
PBS_Server: LOG_ERROR::Cannot assign requested address (99) in send_job, send_job failed to a3010517 port 15002
If I check
momctl -h node_name -d 2
it gives the output but if I check
momctl -p 15002 -h node_name -d 2
it fails with this error
ERROR: query[0] 'diag2' failed on node_name (errno=0-Success: 5-Input/output error)
I can see on compute node that it is listening on port 15002 but request coming to this port stay in TIME_WAIT state
netstat -an | grep 15002
tcp 0 0 0.0.0.0:15002 0.0.0.0:* LISTEN
tcp 0 0 163.1.5.98:15002 163.1.5.44:861 TIME_WAIT
tcp 0 0 163.1.5.98:15002 163.1.5.44:607 TIME_WAIT
tcp 0 0 163.1.5.98:15002 163.1.5.44:831 TIME_WAIT
tcp 0 0 163.1.5.98:15002 163.1.5.44:999 TIME_WAIT
tcp 0 0 163.1.5.98:15002 163.1.5.44:993 TIME_WAIT
tcp 0 0 163.1.5.98:15002 163.1.5.44:985 TIME_WAIT
tcp 0 0 163.1.5.98:15002 163.1.5.44:685 TIME_WAIT
tcp 0 0 163.1.5.98:15002 163.1.5.44:897 TIME_WAIT
We are running torque-2.5.12 and we have around 1300 jobs slots in our cluster.
I will appreciate if some one can give some hints.
Thanks and Regards
Kashif
More information about the torqueusers
mailing list