[torqueusers] Bug? in torque 2.0.0p2

Åke Sandgren ake.sandgren at hpc2n.umu.se
Sat Dec 17 03:29:09 MST 2005


On Fri, 2005-12-16 at 17:37 -0800, Garrick Staples wrote:
> On Fri, Dec 16, 2005 at 08:11:34PM +0100, ?ke Sandgren alleged:
> > Hi!
> > 
> > I have a job that is constantly getting the following in the mom_logs
> > no matter which node it ends up on. It also generates a corefile.
> > 
> > mom_log:
> > 12/16/2005 19:44:28;0001;
> > pbs_mom;Job;200045.ingrid-h.hpc2n.umu.se;phase 2 of
> > job launch successfully completed
> > 12/16/2005 19:44:28;0001;   pbs_mom;Svr;pbs_mom;No such file or
> > directory (2) in TMomFinalizeJob3, read of pipe for sid failed for job
> > 200045.ingrid-h.hpc2n.umu.se (0 of 8 bytes)
> > 12/16/2005 19:44:28;0001;   pbs_mom;Job;TMomFinalizeJob3;start failed,
> > improper
> > sid
> > 12/16/2005 19:44:28;0008;   pbs_mom;Req;send_sisters;sending command
> > ABORT_JOB for job 200045.ingrid-h.hpc2n.umu.se (10)
> 
> This means the child process that will eventually become the job has
> died.  Unfortunately, this is really really hard to debug.
> 
> Have you tried 2.0.0p4?  We're always fixing bugs and have improved
> logging in this area.  Be sure you configure with --enable-syslog.
> 
> One unfixed bug that can cause this is with the job's environmental
> variables.  Vars with newlines and commas can break things.

It's still there in 2.0.0p4 although the error message this time says
Bad file descriptor (9) in TMomFinalizeJob3, read of pipe for sid failed
for job 200045.ingrid-h.hpc2n.umu.se (0 of 8 bytes)

I'll try to find this next week...



More information about the torqueusers mailing list