[torqueusers] problems with server and mom communication
Paul D Marshall
Paul.Marshall at Colorado.EDU
Thu Oct 4 09:39:55 MDT 2012
Torque is having trouble marking jobs that complete as done on the server (they simply stick in the running state). I can submit the jobs fine, they appear to run successfully and the pbs_mom notes their termination and then attempts to notify the server. However at that point pbs_mom hits this error:
pbs_mom;Svr;pbs_mom;LOG_ERROR::Operation now in progress (115) in scan_for_exiting,
I've run into this problem with Torque 2.5.12, 3.0.6, 4.0.2, and 4.1.2 and somewhat basic/default configurations. I've tried increasing the various timeouts as well as mom_job_sync, none of which seems to help.
I did a bit more digging in 2.5.12 and it appears to fail on the bind in client_to_srv in src/lib/Libnet/net_client.c, claiming that the address is already in use, despite the fact that it should try different ports (from what I can tell).
Has anyone else run into this and/or have suggestions? I believe I have hostname resolution setup appropriately, but it's possible something is off slightly (issues with hostname resolution is the most I've been able to gather from the internets as to what might be at the root of pbs_server/mom communication problems).
In this setup the pbs_server and pbs_mom are on different networks, latency is on the order of 10's of ms instead of sub-ms.
More information about the torqueusers