[torqueusers] Stack corruption in pbs_sched 2.1.8?

Tim Miller btmiller at helix.nih.gov
Sat Jul 14 08:42:48 MDT 2007


Hi All,

I've been running the pbs_sched from torque 2.1.8 (the server is still
2.1.6) for a couple months and I've been seeing intermittent scheduler
crashes (usually when the cluster is heavily loaded). A backtrace from
my most recent crash shows the following:

Core was generated by `pbs_sched'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/local/lib/libtorque.so.0...done.
Loaded symbols for /usr/local/lib/libtorque.so.0
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
#0  0x4000e668 in PBSD_mgr_put (c=1768828531, function=11, command=2,
objtype=2, objname=0x9f34e10 "27373.m1.lobos.nih.gov",
aoplp=0x9f23100,
    extend=0x0) at ../Libifl/PBSD_manage2.c:105
105       sock = connection[c].ch_socket;
(gdb) bt
#0  0x4000e668 in PBSD_mgr_put (c=1768828531, function=11, command=2,
objtype=2, objname=0x9f34e10 "27373.m1.lobos.nih.gov",
aoplp=0x9f23100,
    extend=0x0) at ../Libifl/PBSD_manage2.c:105
#1  0x4000e8b1 in PBSD_manager (c=1768828531, function=11, command=2,
objtype=2, objname=0x9f34e10 "27373.m1.lobos.nih.gov",
aoplp=0x9f23100,
    extend=0x0) at ../Libifl/PBSD_manager_caps.c:107
#2  0x4000d7dd in pbs_alterjob (c=1768828531, jobid=0x9f34e10
"27373.m1.lobos.nih.gov", attrib=0x0, extend=0x0)
    at ../Libifl/pbsD_alterjo.c:137
#3  0x0804d204 in update_job_comment (pbs_sd=1768828531, jinfo=0x9f34dd8,
    comment=0xbffc2dd0 "Not Running - PBS Error: Resource temporarily
unavailable REJHOST=o16.lobos.nih.gov MSG=cannot allocate node
'o16.lobos.nih.gov' to job - node not currently available (nps
needed/free: 4/0,  joblist: "...) at job_info.c:632
#4  0x0804bcd7 in run_update_job (pbs_sd=1768828531, sinfo=0x6f672e68,
qinfo=0x29333a76, jinfo=0x9f34dd8) at fifo.c:586
#5  0x6f626f6c in ?? ()
#6  0x696e2e73 in ?? ()
#7  0x6f672e68 in ?? ()
#8  0x29333a76 in ?? ()
#9  0x09f34d00 in ?? ()
#10 0xbffc2f10 in ?? ()
#11 0xbffc2f80 in ?? ()
#12 0x0805496f in num_res ()
#13 0x09f322f0 in ?? ()
#14 0x20746f4e in ?? ()
#15 0x6e6e7552 in ?? ()
#16 0x3a676e69 in ?? ()
#17 0x746f4e20 in ?? ()
#18 0x6f6e6520 in ?? ()
#19 0x20686775 in ?? ()
#20 0x001f7991 in sendto () from /lib/tls/libc.so.6
#21 0x400111eb in rpp_send_ack (sp=0x39333936, seq=Variable "seq" is
not available.
) at ../Libifl/rpp.c:1291
Previous frame inner to this frame (corrupt stack?)

I looked around in rpp.c a bit and did not find anything overtly
wrong. Any ideas on what might cause this?

Tim

-- 
Contractor/System Administrator, Laboratory of Computational Biology NHLBI/NIH
50/3310                               301-402-0618


More information about the torqueusers mailing list