[torqueusers] Torque reject error
Mr Tony Ling
tonylsp at yahoo.com
Mon Nov 8 19:25:33 MST 2004
Hi Corey,
Thanks for your reply, i will try it out on my
site.
By the way, i had found an alternative solution. It is
like this, find the the node which the job was been
rejected and restart the pbs_mom on that node. Then
the job can be run on that nodes so seen like the
problem is on pbs_mom. It is may be because when the
pbs_mom restart it will use different privilege port
to connect to server that is not be used by other
pbs_mom ?
Anyway, thanks a lot for your help.
--- Corey Hirschman <corey at rentec.com> wrote:
> Hello,
>
> We are having the same problems as you and found
> that that problem is that the server is running out
> of ports. Torque is set up to use ports 512-1024
> and during busy times if you do a netstat -na you
> will see all these ports get tied up in TIME_WAIT.
> Once this happens you start to get 15004 and 15001
> (cannot connect to scheduler and mom reject) errors.
>
> I changed the code in run_sched.c and svr_connect.c
> to use any port, not just privledged and this fixed
> the errors
>
(http://www.supercluster.org/pipermail/torqueusers/2004-November/000928.html
> if you want to see the exact changes). This is
> probably not the best solution, but it was just a
> test to see if it would fix the problem. PBS Pro
> only uses privledged ports also and it does not
> exhibit this problem, so we are in the process now
> of examining how it communicates with the nodes and
> scheduler so a fix can be implemented in Torque.
>
> Corey
>
> On Sun, Nov 07, 2004 at 06:29:09PM -0800, Mr Tony
> Ling wrote:
> > Hi,
> >
> > I am using torque-1.1.0p3, and running on 128
> nodes
> > with dua CPU. When there is a lot of running and
> queue
> > job, the error "pbs_mom;Req;req_reject;Reject
> reply
> > code=15004(Invalid request), aux=0, type=3, from
> > PBS_Server at server" would be occur on the node
> running
> > pbs_mom and job begin reject from that node. On
> the
> > server running pbs_server, i get the error "unable
> to
> > run job, MOM rejected".
> > Can anyone help me to figure out what is the
> cause.
> >
> > Thanks.
> >
> >
> >
> > __________________________________
> > Do you Yahoo!?
> > Check out the new Yahoo! Front Page.
> > www.yahoo.com
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> >
> http://supercluster.org/mailman/listinfo/torqueusers
> >
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
>
__________________________________
Do you Yahoo!?
Check out the new Yahoo! Front Page.
www.yahoo.com
More information about the torqueusers
mailing list