[torqueusers] Re: Unable to contact nodes
corey at rentec.com
Thu Nov 4 10:11:44 MST 2004
I have found a cause and solution for the "unable to contact node" and "Address already in use (98) in contact_sched, Could not contact Scheduler - port 15004" errors. It seems that Torque is limited to using ports 512-1024 when contacting the nodes and the scheduler, if all these ports are used then you start to get these errors, jobs are rejected (although they are not really rejected by a mom as they never get there), jobs are deffered, and it becomes very troublesome to delete jobs.
To reproduce this error, on a server with two clients, I ran a script that would grab most port between 512-1024:
for p in $(seq 520 1024); do
netcat -l -p $p &
I did this to simulate a busy server, since I could not do this in production. As expected, after submitting a few jobs the server starts to complain about not being able to contact the scheduler and nodes, exactly like our production server when it is busy.
I made this change and the the scheduler errors disappeared:
--- run_sched.c.orig 2004-11-04 10:58:01.000000000 -0500
+++ run_sched.c 2004-11-04 10:58:21.000000000 -0500
@@ -152,7 +152,7 @@
- sock = client_to_svr(pbs_scheduler_addr,pbs_scheduler_port,1);
+ sock = client_to_svr(pbs_scheduler_addr,pbs_scheduler_port,0);
if (sock < 0)
This chance made the unable to contact nodes errors disappear:
--- svr_connect.c.orig 2004-11-04 10:58:15.000000000 -0500
+++ svr_connect.c 2004-11-04 10:58:36.000000000 -0500
@@ -147,7 +147,7 @@
- sock = client_to_svr(hostaddr,port,1);
+ sock = client_to_svr(hostaddr,port,0);
if (sock < 0)
What is the reason for limiting the ports to 512-1024? Is there a problem making these changes and allowing usage of ports higher than 1024? Are there people with large clusters and heavy job submission that are not experiencing these problems?
More information about the torqueusers