[torqueusers] Bug? in torque 2.0.0p2

Garrick Staples garrick at usc.edu
Fri Dec 16 18:37:03 MST 2005


On Fri, Dec 16, 2005 at 08:11:34PM +0100, ?ke Sandgren alleged:
> Hi!
> 
> I have a job that is constantly getting the following in the mom_logs
> no matter which node it ends up on. It also generates a corefile.
> 
> mom_log:
> 12/16/2005 19:44:28;0001;
> pbs_mom;Job;200045.ingrid-h.hpc2n.umu.se;phase 2 of
> job launch successfully completed
> 12/16/2005 19:44:28;0001;   pbs_mom;Svr;pbs_mom;No such file or
> directory (2) in TMomFinalizeJob3, read of pipe for sid failed for job
> 200045.ingrid-h.hpc2n.umu.se (0 of 8 bytes)
> 12/16/2005 19:44:28;0001;   pbs_mom;Job;TMomFinalizeJob3;start failed,
> improper
> sid
> 12/16/2005 19:44:28;0008;   pbs_mom;Req;send_sisters;sending command
> ABORT_JOB for job 200045.ingrid-h.hpc2n.umu.se (10)

This means the child process that will eventually become the job has
died.  Unfortunately, this is really really hard to debug.

Have you tried 2.0.0p4?  We're always fixing bugs and have improved
logging in this area.  Be sure you configure with --enable-syslog.

One unfixed bug that can cause this is with the job's environmental
variables.  Vars with newlines and commas can break things.


> core-dump:
> Program terminated with signal 11, Segmentation fault.
> #0  0x0806c358 in mom_do_poll (pjob=0x80babd0)
> at ./linux/mom_mach.c:1383
> 1383      assert(pjob != NULL);
> (gdb) where
> #0  0x0806c358 in mom_do_poll (pjob=0x80babd0)
> at ./linux/mom_mach.c:1383
> #1  0x0806c488 in mom_do_poll (pjob=0x80babd0)
> at ./linux/mom_mach.c:1399
> #2  0x08067c5e in start_process (ptask=0x80965a0, argv=0x0,
> envp=0x30303169)
>     at start_exec.c:3085
> #3  0x08067889 in TMomFinalizeJob3 (TJE=0x80965a0, ReadSize=-1080060344,
>     ReadErrno=-1080060344, SC=0x80c3d1e) at start_exec.c:2873
> #4  0x0806bbeb in mom_set_limits (pjob=0x80cc120, set_mode=3)
>     at ./linux/mom_mach.c:1103
> #5  0x0806603e in TMomFinalizeChild (TJE=0x80c90e0) at start_exec.c:1685
> #6  0x0805fe9c in req_reject (code=10, aux=135041248, preq=0x8086f20,
>     HostName=0x80bac80 "job: 200045.ingrid-h.hpc2n.umu.se numnodes=1
> numvnod=1", Msg=0xb7fd40dc "<\236\022") at ./../server/reply_send.c:389
> #7  0x0805fdad in reply_free (prep=0xa) at ./../server/reply_send.c:339
> #8  0x08078325 in decode_DIS_replySvr (sock=1, reply=0x80ba300)
>     at ./../Libifl/dec_rpys.c:181
> #9  0x080508dc in main (argc=10, argv=0x0) at mom_main.c:5053

I can't make any sense of this stack trace.  I think it is long after
the memory corruption.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051216/cf612366/attachment.bin


More information about the torqueusers mailing list