[torqueusers] Torque reject error

Corey Hirschman corey at rentec.com
Mon Nov 8 09:18:42 MST 2004


Hello,

We are having the same problems as you and found that that problem is that the server is running out of ports.  Torque is set up to use ports 512-1024 and during busy times if you do a netstat -na you will see all these ports get tied up in TIME_WAIT.  Once this happens you start to get 15004 and 15001 (cannot connect to scheduler and mom reject) errors.

I changed the code in run_sched.c and svr_connect.c to use any port, not just privledged and this fixed the errors (http://www.supercluster.org/pipermail/torqueusers/2004-November/000928.html if you want to see the exact changes).  This is probably not the best solution, but it was just a test to see if it would fix the problem.  PBS Pro only uses privledged ports also and it does not exhibit this problem, so we are in the process now of examining how it communicates with the nodes and scheduler so a fix can be implemented in Torque. 

Corey

On Sun, Nov 07, 2004 at 06:29:09PM -0800, Mr Tony Ling wrote:
> Hi,
> 
>   I am using torque-1.1.0p3, and running on 128 nodes
> with dua CPU. When there is a lot of running and queue
> job, the error "pbs_mom;Req;req_reject;Reject reply
> code=15004(Invalid request), aux=0, type=3, from
> PBS_Server at server" would be occur on the node running
> pbs_mom and job begin reject from that node. On the
> server running pbs_server, i get the error "unable to
> run job, MOM rejected". 
>    Can anyone help me to figure out what is the cause.
> 
>     Thanks.
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> Check out the new Yahoo! Front Page. 
> www.yahoo.com 
>  
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
> 


More information about the torqueusers mailing list