[torqueusers] pbs_sched cores - repost

Coyle, James J [ITACD] jjc at iastate.edu
Mon Apr 23 10:36:53 MDT 2012


Michael,

  Two possibilities worth exploring:


1)      It seems like you must be using different ports, I see references pbs_mom actions on nodes 58, 40 and 2 related to ports 707, 726 and 746  , normally I'd expect TORQUE  port numbers in the 15000+ range.

Is there an issue with a mismatch of ports between nodes ?



2)      The following two URLS relate to the resolution of an issue that seems similar to yours (recent upgrade, Torque having problems some of the time.)

http://www.clusterresources.com/pipermail/torqueusers/2011-March/012540.html
http://serverfault.com/questions/253932/torque-works-half-of-the-time-fails-no-permission-the-other-half



James Coyle, PhD
High Performance Computing Group
 Iowa State Univ.
web: http://jjc.public.iastate.edu/<http://www.public.iastate.edu/~jjc>

From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Stevens, Michael
Sent: Thursday, April 19, 2012 11:43 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] pbs_sched cores - repost


I had posted this question a few weeks ago, and received no response.  Would it be more appropriate to post this to -dev?

I am running a 115 node cluster using torque 2.5.7 under CentOS 6.2. This cluster is in turn running on a Vmware ESX 4.0 cluster; the idea here being that we can use the physical resources of the torque cluster when no jobs are running.

I am seeing crashes of pbs_sched when the cluster gets busy. Following is some data I've been able to assemble thus far:

/var/log/messages

Apr 19 08:53:31 node103 dhclient[1540]: DHCPREQUEST on eth0 to 10.80.101.10 port 67 (xid=0x6c91cbd5)
Apr 19 08:53:31 cluster1 dhcpd: DHCPREQUEST for 10.80.101.123 from 00:50:56:b4:7b:a3 via eth0
Apr 19 08:53:31 cluster1 dhcpd: DHCPACK on 10.80.101.123 to 00:50:56:b4:7b:a3 via eth0
Apr 19 08:53:31 node103 dhclient[1540]: DHCPACK from 10.80.101.10 (xid=0x6c91cbd5)
Apr 19 08:53:33 node103 ypbind: NIS domain: affymetrix.com, NIS server: nis2
Apr 19 08:53:33 node103 dhclient[1540]: bound to 10.80.101.123 -- renewal in 16219 seconds.
Apr 19 08:54:09 node58 pbs_mom: LOG_ERROR::Operation now in progress (115) in scan_for_exiting, cannot connect to port 707 in client_to_svr - errno:115 Operation now in progress
Apr 19 08:54:15 cluster1 abrt[3555]: saved core dump of pid 2911 (/usr/sbin/pbs_sched) to /var/spool/abrt/ccpp-2012-04-19-08:54:14-2911.new/coredump (3100672 bytes)
Apr 19 08:54:15 cluster1 abrtd: Directory 'ccpp-2012-04-19-08:54:14-2911' creation detected
Apr 19 08:54:21 cluster1 abrtd: Package 'torque-scheduler' isn't signed with proper key
Apr 19 08:54:21 cluster1 abrtd: Corrupted or bad dump /var/spool/abrt/ccpp-2012-04-19-08:54:14-2911 (res:2), deleting
Apr 19 08:55:46 node40 pbs_mom: LOG_ERROR::Operation now in progress (115) in post_epilogue, cannot connect to port 726 in client_to_svr - errno:115 Operation now in progress
Apr 19 08:55:47 node2 pbs_mom: LOG_ERROR::Operation now in progress (115) in post_epilogue, cannot connect to port 746 in client_to_svr - errno:115 Operation now in progress



scheduler log

04/19/2012 08:52:55;0040; pbs_sched;Job;302539.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 08:52:55;0040; pbs_sched;Job;302540.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 08:52:55;0040; pbs_sched;Job;302541.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 08:52:55;0040; pbs_sched;Job;302542.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 08:52:55;0040; pbs_sched;Job;302543.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 08:52:55;0040; pbs_sched;Job;302544.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 08:52:55;0080; pbs_sched;Svr;main;brk point 38178816
04/19/2012 08:52:58;0040; pbs_sched;Job;302545.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 08:53:08;0040; pbs_sched;Job;302546.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 08:53:10;0040; pbs_sched;Job;302547.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 08:53:14;0040; pbs_sched;Job;302548.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 08:53:18;0040; pbs_sched;Job;302549.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 08:53:24;0040; pbs_sched;Job;302550.cluster1.cluster.affymetrix.com;Job Run
04/19/2012 09:14:01;0002; pbs_sched;Svr;Log;Log opened
04/19/2012 09:14:01;0002; pbs_sched;Svr;TokenAct;Account file /var/lib/torque/sched_priv/accounting/20120419 opened
04/19/2012 09:14:01;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup pid 4707
04/19/2012 09:14:54;0040; pbs_sched;Job;302552.cluster1.cluster.affymetrix.com;Job Run


gdb of the crash file

[root at cluster1 sched_priv]# gdb -e /usr/sbin/pbs_sched -c core.2911
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
[New Thread 2911]
Missing separate debuginfo for
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/23/1bd9599ad974226f19adfdc4dae3691396c81d
Reading symbols from /usr/lib64/libtorque.so.2.0.0...Reading symbols from /usr/lib/debug/usr/lib64/libtorque.so.2.0.0.debug...done.
done.
Loaded symbols for /usr/lib64/libtorque.so.2.0.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_dns.so.2
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Core was generated by `/usr/sbin/pbs_sched -d /var/lib/torque -a 600'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=<value optimized out>, num_resc=<value optimized out>,
    available=0x7fffba82910c, allocated=0x7fffba829108, reserved=0x7fffba829104, down=0x7fffba829100)
    at ../Libifl/pbsD_resc.c:215
215        *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64
(gdb) bt
#0  0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=<value optimized out>, num_resc=<value optimized out>,
    available=0x7fffba82910c, allocated=0x7fffba829108, reserved=0x7fffba829104, down=0x7fffba829100)
    at ../Libifl/pbsD_resc.c:215
#1  0x000000000040c8d6 in ?? ()
#2  0x00007fffba829100 in ?? ()
#3  0x0000000000000000 in ?? ()
(gdb)


The last few lines of strace

read(8, "+2+1+0+0+6+0", 262144)         = 12
write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}])
read(8, "+2+1+0+0+6+0", 262144)         = 12
write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}])
read(8, "+2+1+0+0+6+0", 262144)         = 12
write(8, "+2+12+51+9Scheduler+12+252+11des"..., 62) = 62
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}])
read(8, "+2+1+0+0+6+0", 262144)         = 12
write(8, "+2+12+51+9Scheduler+12+272+11des"..., 64) = 64
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}])
read(8, "+2+1+0+0+6+0", 262144)         = 12
write(8, "+2+12+51+9Scheduler+12+192+11des"..., 54) = 54
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}])
read(8, "+2+1+0+0+6+0", 262144)         = 12
write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}])
read(8, "+2+1+0+0+6+0", 262144)         = 12
write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}])
read(8, "+2+1+0+0+6+0", 262144)         = 12
write(8, "+2+12+51+9Scheduler+12+232+11des"..., 60) = 60
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}])
read(8, "+2+1+0+0+6+0", 262144)         = 12
write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}])
read(8, "+2+1+0+0+9+1+1+0+0+0", 262144) = 20
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 124) = 124
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}])
read(8, "+2+1+0+0+1", 262144)           = 10
write(8, "+2+12+15+9Scheduler2+38302551.cl"..., 67) = 67
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout)
write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 139) = 139
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout)
write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42
poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8, revents=POLLIN}])
read(8, "+2+1+0+0+1", 262144)           = 10
--- SIGSEGV (Segmentation fault) @ 0 (0) ---

If there is any other information I can provide, please let me know as this is reproducible.

--
Mike Stevens
Senior UNIX Administrator
Affymetrix | 3420 Central Expressway | Santa Clara, CA 95051
Tel: 408-731-5804 | Cell: 408-507-5738

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120423/8f4024b5/attachment-0001.html 


More information about the torqueusers mailing list