[torqueusers] pbs_sched cores

mike.d.stevens at gmail.com mike.d.stevens at gmail.com
Wed Mar 28 15:11:57 MDT 2012


I am running a 115 node cluster using torque 2.5.7 under CentOS 6.2. This  
cluster is in turn running on a Vmware ESX 4.0 cluster; the idea here being  
that we can use the physical resources of the torque cluster when no jobs  
are running.

I am seeing crashes of pbs_sched when the cluster gets busy, which seem to  
be more pronounced when the network is busy. Following is some data I've  
been able to assemble thus far:

/var/log/messages

Mar 28 09:44:41 node25 pbs_mom: LOG_ERROR::Broken pipe (32) in rm_request,  
write request response failed: Protocol failure in commit#012#011message  
refused from port 1022 addr 10.80.101.10
Mar 28 09:44:51 node27 pbs_mom: LOG_ERROR::Broken pipe (32) in rm_request,  
write request response failed: Protocol failure in commit#012#011message  
refused from port 1022 addr 10.80.101.10
Mar 28 09:47:05 cluster1 kernel: pbs_sched[15017]: segfault at 0 ip  
0000003ff4a13c44 sp 00007fff58347c90 error 4 in  
libtorque.so.2.0.0[3ff4a00000+2d000]
Mar 28 09:47:05 cluster1 abrt[23193]: saved core dump of pid 15017  
(/usr/sbin/pbs_sched) to  
/var/spool/abrt/ccpp-2012-03-28-09:47:05-15017.new/coredump (3407872 bytes)
Mar 28 09:47:05 cluster1 abrtd: Directory 'ccpp-2012-03-28-09:47:05-15017'  
creation detected
Mar 28 09:47:05 cluster1 abrtd: Package 'torque-scheduler' isn't signed  
with proper key
Mar 28 09:47:05 cluster1 abrtd: Corrupted or bad dump  
/var/spool/abrt/ccpp-2012-03-28-09:47:05-15017 (res:2), deleting

sched_log

03/28/2012 09:46:55;0040;  
pbs_sched;Job;267980.cluster1.cluster.affymetrix.com;Job Run
03/28/2012 09:46:55;0040;  
pbs_sched;Job;267981.cluster1.cluster.affymetrix.com;Job Run
03/28/2012 09:47:00;0040;  
pbs_sched;Job;267982.cluster1.cluster.affymetrix.com;Job Run
03/28/2012 09:47:05;0040;  
pbs_sched;Job;267983.cluster1.cluster.affymetrix.com;Job Run
03/28/2012 09:53:32;0002; pbs_sched;Svr;Log;Log opened
03/28/2012 09:53:32;0002; pbs_sched;Svr;TokenAct;Account file  
/var/lib/torque/sched_priv/accounting/20120328 opened
03/28/2012 09:53:32;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup  
pid 23588
03/28/2012 09:53:33;0040;  
pbs_sched;Job;267984.cluster1.cluster.affymetrix.com;Job Run
03/28/2012 09:53:34;0040;  
pbs_sched;Job;267985.cluster1.cluster.affymetrix.com;Job Run

gdb of core file

[root at cluster1 sched_priv]# gdb -e /usr/sbin/pbs_sched -c core.15017
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later  
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
[New Thread 15017]
Missing separate debuginfo for
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install  
/usr/lib/debug/.build-id/23/1bd9599ad974226f19adfdc4dae3691396c81d
Reading symbols from /usr/lib64/libtorque.so.2.0.0...Reading symbols from  
/usr/lib/debug/usr/lib64/libtorque.so.2.0.0.debug...done.
done.
Loaded symbols for /usr/lib64/libtorque.so.2.0.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols  
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols  
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols  
found)...done.
Loaded symbols for /lib64/libnss_dns.so.2
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols  
found)...done.
Loaded symbols for /lib64/libresolv.so.2
Core was generated by `/usr/sbin/pbs_sched -d /var/lib/torque -a 600'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=<value optimized  
out>, num_resc=<value optimized out>,
available=0x7fff58347d0c, allocated=0x7fff58347d08,  
reserved=0x7fff58347d04, down=0x7fff58347d00)
at ../Libifl/pbsD_resc.c:215
215 *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);
Missing separate debuginfos, use: debuginfo-install  
glibc-2.12-1.47.el6_2.5.x86_64
(gdb) bt
#0 0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=<value optimized  
out>, num_resc=<value optimized out>,
available=0x7fff58347d0c, allocated=0x7fff58347d08,  
reserved=0x7fff58347d04, down=0x7fff58347d00)
at ../Libifl/pbsD_resc.c:215
#1 0x000000000040c8d6 in ?? ()
#2 0x00007fff58347d00 in ?? ()
#3 0x0000000000000000 in ?? ()
(gdb)


Does anyone have any ideas as to what is wrong here? I'd be happy to  
provide additional information.

--
Mike Stevens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120328/73791b1f/attachment.html 


More information about the torqueusers mailing list