[torqueusers] Bad file descriptor (9) in req_jobscript, job inunexpected state

Davide Salomoni Davide.Salomoni at nikhef.nl
Mon Dec 6 08:01:49 MST 2004


Dave,

thanks for your reply. I see that the failure keeps on reappearing, but I am
not able to reproduce it at will: it just happens from time to time (e.g.
this weekend) and requires us to restart mom on the affected nodes. How can
I help you in tracking this down? (logfiles, raise debug level, etc)

Davide

> -----Original Message-----
> From: Dave Jackson [mailto:jacksond at supercluster.org]
> Sent: Thursday, December 02, 2004 19:09
> To: Davide Salomoni
> Cc: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Bad file descriptor (9) in req_jobscript, job
> inunexpected state
> 
> Davide,
> 
>   The failure you are seeing is resulting from the pbs_server daemon
> attempting to stage a job script to the PBS mom and finding the job in a
> non-idle state.  Most likely, your job was originally submitted to the
> failing MOM and for some reason failed.  When it failed to start, the
> pbs_Server assumed the job was cleared out of the MOM and, at a later
> point, resubmitted the job.  However, the mom is now rejecting the
> subsequent stage activity declaring that the job is already staged.
> pbs_server fails and tries again, and again.
> 
>   Two possible solutions.  First, if the MOM detects that the server is
> attempting to stage a job and the job is not executing, the MOM should
> probably purge its local job and fail allowing the next stage request to
> succeed.  However, the most correct solution is to determine the
> original failure which started this issue and make certain the local MOM
> job is purged at that point.
> 
>   If you can reproduce this failure somewhat reliably, we would be happy
> to assist you in implementing and testing these fixes.
> 
> Thanks,
> Dave
> 
> On Wed, 2004-12-01 at 02:22, Davide Salomoni wrote:
> > Hello,
> >
> > after the upgrade to torque 1.1.0p4, some of the nodes of my farm
> generate
> > the following messages:
> >
> > 12/01/2004 10:09:27;0001;   pbs_mom;Svr;pbs_mom;Bad file descriptor (9)
> in
> > req_jobscript, job in unexpected state
> > 12/01/2004 10:09:27;0080;   pbs_mom;Req;req_reject;Reject reply
> > code=15004(Invalid request), aux=0, type=3, from
> PBS_Server at tbn18.nikhef.nl
> >
> >
> > Why am I getting these messages?
> >
> > Apparently, the MOM process on those nodes does not work anymore. I
> tried
> > first of all to cycle the MOM using the new momctl command from the
> server,
> > as in
> >
> > [root at tbn18 root]# ./momctl -C -h node15-9.farmnet.nikhef.nl
> > mom node15-9.farmnet.nikhef.nl successfully cycled cycle forced
> >
> > which results in the following message on the node:
> >
> > 12/01/2004 10:14:23;0002;   pbs_mom;n/a;rm_request;reporting cycle
> forced
> >
> > but this does not solve the problem. I thought momctl would trigger a
> full
> > mom restart, and it doesn't. Is that right?
> >
> > But if I manually restart MOM *on the node*, as in
> >
> > [root at node15-9 root]# service pbs restart
> >
> > the problem is gone. Could you help me understanding what's going on?
> >
> > Thanks,
> > Davide
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list