[torqueusers] Torque reject error

Mr Tony Ling tonylsp at yahoo.com
Mon Nov 8 19:25:33 MST 2004


Hi Corey,

   Thanks for your reply, i will try it out on my
site.
By the way, i had found an alternative solution. It is
like this, find the the node which the job was been
rejected and restart the pbs_mom on that node. Then
the job can be run on that nodes so seen like the
problem is on pbs_mom. It is may be because when the
pbs_mom restart it will use different privilege port
to connect to server that is not be used by other
pbs_mom ?
    Anyway, thanks a lot for your help.



--- Corey Hirschman <corey at rentec.com> wrote:

> Hello,
> 
> We are having the same problems as you and found
> that that problem is that the server is running out
> of ports.  Torque is set up to use ports 512-1024
> and during busy times if you do a netstat -na you
> will see all these ports get tied up in TIME_WAIT. 
> Once this happens you start to get 15004 and 15001
> (cannot connect to scheduler and mom reject) errors.
> 
> I changed the code in run_sched.c and svr_connect.c
> to use any port, not just privledged and this fixed
> the errors
>
(http://www.supercluster.org/pipermail/torqueusers/2004-November/000928.html
> if you want to see the exact changes).  This is
> probably not the best solution, but it was just a
> test to see if it would fix the problem.  PBS Pro
> only uses privledged ports also and it does not
> exhibit this problem, so we are in the process now
> of examining how it communicates with the nodes and
> scheduler so a fix can be implemented in Torque. 
> 
> Corey
> 
> On Sun, Nov 07, 2004 at 06:29:09PM -0800, Mr Tony
> Ling wrote:
> > Hi,
> > 
> >   I am using torque-1.1.0p3, and running on 128
> nodes
> > with dua CPU. When there is a lot of running and
> queue
> > job, the error "pbs_mom;Req;req_reject;Reject
> reply
> > code=15004(Invalid request), aux=0, type=3, from
> > PBS_Server at server" would be occur on the node
> running
> > pbs_mom and job begin reject from that node. On
> the
> > server running pbs_server, i get the error "unable
> to
> > run job, MOM rejected". 
> >    Can anyone help me to figure out what is the
> cause.
> > 
> >     Thanks.
> > 
> > 
> > 		
> > __________________________________ 
> > Do you Yahoo!? 
> > Check out the new Yahoo! Front Page. 
> > www.yahoo.com 
> >  
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> >
> http://supercluster.org/mailman/listinfo/torqueusers
> > 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
> 



		
__________________________________ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 



More information about the torqueusers mailing list