[torqueusers] Bad file descriptor (9) in req_jobscript, job in
jacksond at supercluster.org
Thu Dec 2 11:08:53 MST 2004
The failure you are seeing is resulting from the pbs_server daemon
attempting to stage a job script to the PBS mom and finding the job in a
non-idle state. Most likely, your job was originally submitted to the
failing MOM and for some reason failed. When it failed to start, the
pbs_Server assumed the job was cleared out of the MOM and, at a later
point, resubmitted the job. However, the mom is now rejecting the
subsequent stage activity declaring that the job is already staged.
pbs_server fails and tries again, and again.
Two possible solutions. First, if the MOM detects that the server is
attempting to stage a job and the job is not executing, the MOM should
probably purge its local job and fail allowing the next stage request to
succeed. However, the most correct solution is to determine the
original failure which started this issue and make certain the local MOM
job is purged at that point.
If you can reproduce this failure somewhat reliably, we would be happy
to assist you in implementing and testing these fixes.
On Wed, 2004-12-01 at 02:22, Davide Salomoni wrote:
> after the upgrade to torque 1.1.0p4, some of the nodes of my farm generate
> the following messages:
> 12/01/2004 10:09:27;0001; pbs_mom;Svr;pbs_mom;Bad file descriptor (9) in
> req_jobscript, job in unexpected state
> 12/01/2004 10:09:27;0080; pbs_mom;Req;req_reject;Reject reply
> code=15004(Invalid request), aux=0, type=3, from PBS_Server at tbn18.nikhef.nl
> Why am I getting these messages?
> Apparently, the MOM process on those nodes does not work anymore. I tried
> first of all to cycle the MOM using the new momctl command from the server,
> as in
> [root at tbn18 root]# ./momctl -C -h node15-9.farmnet.nikhef.nl
> mom node15-9.farmnet.nikhef.nl successfully cycled cycle forced
> which results in the following message on the node:
> 12/01/2004 10:14:23;0002; pbs_mom;n/a;rm_request;reporting cycle forced
> but this does not solve the problem. I thought momctl would trigger a full
> mom restart, and it doesn't. Is that right?
> But if I manually restart MOM *on the node*, as in
> [root at node15-9 root]# service pbs restart
> the problem is gone. Could you help me understanding what's going on?
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers