[torqueusers] momctl errors

Kashif Mohammad k.mohammad1 at physics.ox.ac.uk
Tue Dec 18 05:22:39 MST 2012


Hi 

I am troubleshooting a problem at our cluster where some times jobs at compute nodes lost contact with torque server. In this case jobs keep running at compute node while torque server thinks that job has been finished.


There are lot of this errors in log file

PBS_Server: LOG_ERROR::Cannot assign requested address (99) in send_job, send_job failed to a3010517 port 15002 

If I check
momctl -h node_name -d 2

it gives the output but if I check

momctl  -p 15002 -h node_name -d 2

it fails with this error
ERROR:    query[0] 'diag2' failed on node_name (errno=0-Success: 5-Input/output error)

I can see on compute node that it is listening on port 15002 but request coming to this port stay in TIME_WAIT state


netstat -an | grep 15002
tcp        0      0 0.0.0.0:15002               0.0.0.0:*                   LISTEN
tcp        0      0 163.1.5.98:15002            163.1.5.44:861              TIME_WAIT
tcp        0      0 163.1.5.98:15002            163.1.5.44:607              TIME_WAIT
tcp        0      0 163.1.5.98:15002            163.1.5.44:831              TIME_WAIT
tcp        0      0 163.1.5.98:15002            163.1.5.44:999              TIME_WAIT
tcp        0      0 163.1.5.98:15002            163.1.5.44:993              TIME_WAIT
tcp        0      0 163.1.5.98:15002            163.1.5.44:985              TIME_WAIT
tcp        0      0 163.1.5.98:15002            163.1.5.44:685              TIME_WAIT
tcp        0      0 163.1.5.98:15002            163.1.5.44:897              TIME_WAIT

We are running torque-2.5.12 and we have around 1300 jobs slots in our cluster.

I will appreciate if some one can give some hints.

Thanks and Regards
Kashif



More information about the torqueusers mailing list