[torqueusers] Bad file descriptor (9) in req_jobscript,
job inunexpected state
Davide.Salomoni at nikhef.nl
Mon Dec 6 08:01:49 MST 2004
thanks for your reply. I see that the failure keeps on reappearing, but I am
not able to reproduce it at will: it just happens from time to time (e.g.
this weekend) and requires us to restart mom on the affected nodes. How can
I help you in tracking this down? (logfiles, raise debug level, etc)
> -----Original Message-----
> From: Dave Jackson [mailto:jacksond at supercluster.org]
> Sent: Thursday, December 02, 2004 19:09
> To: Davide Salomoni
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Bad file descriptor (9) in req_jobscript, job
> inunexpected state
> The failure you are seeing is resulting from the pbs_server daemon
> attempting to stage a job script to the PBS mom and finding the job in a
> non-idle state. Most likely, your job was originally submitted to the
> failing MOM and for some reason failed. When it failed to start, the
> pbs_Server assumed the job was cleared out of the MOM and, at a later
> point, resubmitted the job. However, the mom is now rejecting the
> subsequent stage activity declaring that the job is already staged.
> pbs_server fails and tries again, and again.
> Two possible solutions. First, if the MOM detects that the server is
> attempting to stage a job and the job is not executing, the MOM should
> probably purge its local job and fail allowing the next stage request to
> succeed. However, the most correct solution is to determine the
> original failure which started this issue and make certain the local MOM
> job is purged at that point.
> If you can reproduce this failure somewhat reliably, we would be happy
> to assist you in implementing and testing these fixes.
> On Wed, 2004-12-01 at 02:22, Davide Salomoni wrote:
> > Hello,
> > after the upgrade to torque 1.1.0p4, some of the nodes of my farm
> > the following messages:
> > 12/01/2004 10:09:27;0001; pbs_mom;Svr;pbs_mom;Bad file descriptor (9)
> > req_jobscript, job in unexpected state
> > 12/01/2004 10:09:27;0080; pbs_mom;Req;req_reject;Reject reply
> > code=15004(Invalid request), aux=0, type=3, from
> PBS_Server at tbn18.nikhef.nl
> > Why am I getting these messages?
> > Apparently, the MOM process on those nodes does not work anymore. I
> > first of all to cycle the MOM using the new momctl command from the
> > as in
> > [root at tbn18 root]# ./momctl -C -h node15-9.farmnet.nikhef.nl
> > mom node15-9.farmnet.nikhef.nl successfully cycled cycle forced
> > which results in the following message on the node:
> > 12/01/2004 10:14:23;0002; pbs_mom;n/a;rm_request;reporting cycle
> > but this does not solve the problem. I thought momctl would trigger a
> > mom restart, and it doesn't. Is that right?
> > But if I manually restart MOM *on the node*, as in
> > [root at node15-9 root]# service pbs restart
> > the problem is gone. Could you help me understanding what's going on?
> > Thanks,
> > Davide
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers