[torqueusers] Bug? in torque 2.0.0p2

Åke Sandgren ake.sandgren at hpc2n.umu.se
Fri Dec 16 12:11:34 MST 2005


Hi!

I have a job that is constantly getting the following in the mom_logs
no matter which node it ends up on. It also generates a corefile.

mom_log:
12/16/2005 19:44:28;0001;
pbs_mom;Job;200045.ingrid-h.hpc2n.umu.se;phase 2 of
job launch successfully completed
12/16/2005 19:44:28;0001;   pbs_mom;Svr;pbs_mom;No such file or
directory (2) in TMomFinalizeJob3, read of pipe for sid failed for job
200045.ingrid-h.hpc2n.umu.se (0 of 8 bytes)
12/16/2005 19:44:28;0001;   pbs_mom;Job;TMomFinalizeJob3;start failed,
improper
sid
12/16/2005 19:44:28;0008;   pbs_mom;Req;send_sisters;sending command
ABORT_JOB for job 200045.ingrid-h.hpc2n.umu.se (10)




core-dump:
Program terminated with signal 11, Segmentation fault.
#0  0x0806c358 in mom_do_poll (pjob=0x80babd0)
at ./linux/mom_mach.c:1383
1383      assert(pjob != NULL);
(gdb) where
#0  0x0806c358 in mom_do_poll (pjob=0x80babd0)
at ./linux/mom_mach.c:1383
#1  0x0806c488 in mom_do_poll (pjob=0x80babd0)
at ./linux/mom_mach.c:1399
#2  0x08067c5e in start_process (ptask=0x80965a0, argv=0x0,
envp=0x30303169)
    at start_exec.c:3085
#3  0x08067889 in TMomFinalizeJob3 (TJE=0x80965a0, ReadSize=-1080060344,
    ReadErrno=-1080060344, SC=0x80c3d1e) at start_exec.c:2873
#4  0x0806bbeb in mom_set_limits (pjob=0x80cc120, set_mode=3)
    at ./linux/mom_mach.c:1103
#5  0x0806603e in TMomFinalizeChild (TJE=0x80c90e0) at start_exec.c:1685
#6  0x0805fe9c in req_reject (code=10, aux=135041248, preq=0x8086f20,
    HostName=0x80bac80 "job: 200045.ingrid-h.hpc2n.umu.se numnodes=1
numvnod=1", Msg=0xb7fd40dc "<\236\022") at ./../server/reply_send.c:389
#7  0x0805fdad in reply_free (prep=0xa) at ./../server/reply_send.c:339
#8  0x08078325 in decode_DIS_replySvr (sock=1, reply=0x80ba300)
    at ./../Libifl/dec_rpys.c:181
#9  0x080508dc in main (argc=10, argv=0x0) at mom_main.c:5053


Please help.



More information about the torqueusers mailing list