[torqueusers] pbs_sched cores - repost

David Beer dbeer at adaptivecomputing.com
Fri Apr 20 09:04:56 MDT 2012


Mike,

So pbs_sched is crashing for you? Have you considered setting up Maui?
There are very few shops that run pbs_sched and probably very few people
that have experience to help you with problems you might encounter using
it. Maui has a fairly large user base, and a lot of things are similar in
Maui and Moab, giving you two large user bases that can potentially help
you out. I personally know nothing almost nothing about pbs_sched and can
be of almost no help.

David

On Thu, Apr 19, 2012 at 10:43 AM, Stevens, Michael <
Michael_Stevens at affymetrix.com> wrote:

> ** **
>
> I had posted this question a few weeks ago, and received no response.
> Would it be more appropriate to post this to –dev?  ****
>
> ** **
>
> I am running a 115 node cluster using torque 2.5.7 under CentOS 6.2. This
> cluster is in turn running on a Vmware ESX 4.0 cluster; the idea here being
> that we can use the physical resources of the torque cluster when no jobs
> are running.
>
> I am seeing crashes of pbs_sched when the cluster gets busy. Following is
> some data I’ve been able to assemble thus far:****
>
> ** **
>
> /var/log/messages****
>
> ** **
>
> Apr 19 08:53:31 node103 dhclient[1540]: DHCPREQUEST on eth0 to
> 10.80.101.10 port 67 (xid=0x6c91cbd5)****
>
> Apr 19 08:53:31 cluster1 dhcpd: DHCPREQUEST for 10.80.101.123 from
> 00:50:56:b4:7b:a3 via eth0****
>
> Apr 19 08:53:31 cluster1 dhcpd: DHCPACK on 10.80.101.123 to
> 00:50:56:b4:7b:a3 via eth0****
>
> Apr 19 08:53:31 node103 dhclient[1540]: DHCPACK from 10.80.101.10
> (xid=0x6c91cbd5)****
>
> Apr 19 08:53:33 node103 ypbind: NIS domain: affymetrix.com, NIS server:
> nis2****
>
> Apr 19 08:53:33 node103 dhclient[1540]: bound to 10.80.101.123 -- renewal
> in 16219 seconds.****
>
> Apr 19 08:54:09 node58 pbs_mom: LOG_ERROR::Operation now in progress (115)
> in scan_for_exiting, cannot connect to port 707 in client_to_svr -
> errno:115 Operation now in progress****
>
> Apr 19 08:54:15 cluster1 abrt[3555]: saved core dump of pid 2911
> (/usr/sbin/pbs_sched) to /var/spool/abrt/ccpp-2012-04-19-08:54:14-2911.new/coredump
> (3100672 bytes)****
>
> Apr 19 08:54:15 cluster1 abrtd: Directory 'ccpp-2012-04-19-08:54:14-2911'
> creation detected****
>
> Apr 19 08:54:21 cluster1 abrtd: Package 'torque-scheduler' isn't signed
> with proper key****
>
> Apr 19 08:54:21 cluster1 abrtd: Corrupted or bad dump /var/spool/abrt/ccpp-
> 2012-04-19-08:54:14-2911 (res:2), deleting****
>
> Apr 19 08:55:46 node40 pbs_mom: LOG_ERROR::Operation now in progress (115)
> in post_epilogue, cannot connect to port 726 in client_to_svr - errno:115
> Operation now in progress****
>
> Apr 19 08:55:47 node2 pbs_mom: LOG_ERROR::Operation now in progress (115)
> in post_epilogue, cannot connect to port 746 in client_to_svr - errno:115
> Operation now in progress****
>
> ** **
>
> ** **
>
> ** **
>
> scheduler log****
>
> ** **
>
> 04/19/2012 08:52:55;0040; pbs_sched;Job;
> 302539.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 08:52:55;0040; pbs_sched;Job;
> 302540.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 08:52:55;0040; pbs_sched;Job;
> 302541.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 08:52:55;0040; pbs_sched;Job;
> 302542.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 08:52:55;0040; pbs_sched;Job;
> 302543.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 08:52:55;0040; pbs_sched;Job;
> 302544.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 08:52:55;0080; pbs_sched;Svr;main;brk point 38178816****
>
> 04/19/2012 08:52:58;0040; pbs_sched;Job;
> 302545.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 08:53:08;0040; pbs_sched;Job;
> 302546.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 08:53:10;0040; pbs_sched;Job;
> 302547.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 08:53:14;0040; pbs_sched;Job;
> 302548.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 08:53:18;0040; pbs_sched;Job;
> 302549.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 08:53:24;0040; pbs_sched;Job;
> 302550.cluster1.cluster.affymetrix.com;Job Run****
>
> 04/19/2012 09:14:01;0002; pbs_sched;Svr;Log;Log opened****
>
> 04/19/2012 09:14:01;0002; pbs_sched;Svr;TokenAct;Account file
> /var/lib/torque/sched_priv/accounting/20120419 opened****
>
> 04/19/2012 09:14:01;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup
> pid 4707****
>
> 04/19/2012 09:14:54;0040; pbs_sched;Job;
> 302552.cluster1.cluster.affymetrix.com;Job Run****
>
> ** **
>
> ** **
>
> gdb of the crash file****
>
> ** **
>
> [root at cluster1 sched_priv]# gdb -e /usr/sbin/pbs_sched -c core.2911****
>
> GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)****
>
> Copyright (C) 2010 Free Software Foundation, Inc.****
>
> License GPLv3+: GNU GPL version 3 or later <
> http://gnu.org/licenses/gpl.html>****
>
> This is free software: you are free to change and redistribute it.****
>
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> ****
>
> and "show warranty" for details.****
>
> This GDB was configured as "x86_64-redhat-linux-gnu".****
>
> For bug reporting instructions, please see:****
>
> <http://www.gnu.org/software/gdb/bugs/>.****
>
> [New Thread 2911]****
>
> Missing separate debuginfo for ****
>
> Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install
> /usr/lib/debug/.build-id/23/1bd9599ad974226f19adfdc4dae3691396c81d****
>
> Reading symbols from /usr/lib64/libtorque.so.2.0.0...Reading symbols from
> /usr/lib/debug/usr/lib64/libtorque.so.2.0.0.debug...done.****
>
> done.****
>
> Loaded symbols for /usr/lib64/libtorque.so.2.0.0****
>
> Reading symbols from /lib64/libc.so.6...(no debugging symbols
> found)...done.****
>
> Loaded symbols for /lib64/libc.so.6****
>
> Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
> found)...done.****
>
> Loaded symbols for /lib64/ld-linux-x86-64.so.2****
>
> Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols
> found)...done.****
>
> Loaded symbols for /lib64/libnss_files.so.2****
>
> Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols
> found)...done.****
>
> Loaded symbols for /lib64/libnss_dns.so.2****
>
> Reading symbols from /lib64/libresolv.so.2...(no debugging symbols
> found)...done.****
>
> Loaded symbols for /lib64/libresolv.so.2****
>
> Core was generated by `/usr/sbin/pbs_sched -d /var/lib/torque -a 600'.****
>
> Program terminated with signal 11, Segmentation fault.****
>
> #0  0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=<value optimized
> out>, num_resc=<value optimized out>, ****
>
>     available=0x7fffba82910c, allocated=0x7fffba829108,
> reserved=0x7fffba829104, down=0x7fffba829100)****
>
>     at ../Libifl/pbsD_resc.c:215****
>
> 215        *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);***
> *
>
> Missing separate debuginfos, use: debuginfo-install
> glibc-2.12-1.47.el6_2.5.x86_64****
>
> (gdb) bt****
>
> #0  0x0000003ff4a13c44 in pbs_rescquery (c=0, resclist=<value optimized
> out>, num_resc=<value optimized out>, ****
>
>     available=0x7fffba82910c, allocated=0x7fffba829108,
> reserved=0x7fffba829104, down=0x7fffba829100)****
>
>     at ../Libifl/pbsD_resc.c:215****
>
> #1  0x000000000040c8d6 in ?? ()****
>
> #2  0x00007fffba829100 in ?? ()****
>
> #3  0x0000000000000000 in ?? ()****
>
> (gdb) ****
>
> ** **
>
> ** **
>
> The last few lines of strace****
>
> ** **
>
> read(8, "+2+1+0+0+6+0", 262144)         = 12****
>
> write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8,
> revents=POLLIN}])****
>
> read(8, "+2+1+0+0+6+0", 262144)         = 12****
>
> write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8,
> revents=POLLIN}])****
>
> read(8, "+2+1+0+0+6+0", 262144)         = 12****
>
> write(8, "+2+12+51+9Scheduler+12+252+11des"..., 62) = 62****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8,
> revents=POLLIN}])****
>
> read(8, "+2+1+0+0+6+0", 262144)         = 12****
>
> write(8, "+2+12+51+9Scheduler+12+272+11des"..., 64) = 64****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8,
> revents=POLLIN}])****
>
> read(8, "+2+1+0+0+6+0", 262144)         = 12****
>
> write(8, "+2+12+51+9Scheduler+12+192+11des"..., 54) = 54****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8,
> revents=POLLIN}])****
>
> read(8, "+2+1+0+0+6+0", 262144)         = 12****
>
> write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8,
> revents=POLLIN}])****
>
> read(8, "+2+1+0+0+6+0", 262144)         = 12****
>
> write(8, "+2+12+51+9Scheduler+12+182+11des"..., 53) = 53****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8,
> revents=POLLIN}])****
>
> read(8, "+2+1+0+0+6+0", 262144)         = 12****
>
> write(8, "+2+12+51+9Scheduler+12+232+11des"..., 60) = 60****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8,
> revents=POLLIN}])****
>
> read(8, "+2+1+0+0+6+0", 262144)         = 12****
>
> write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8,
> revents=POLLIN}])****
>
> read(8, "+2+1+0+0+9+1+1+0+0+0", 262144) = 20****
>
> stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0****
>
> write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 124) = 124****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8,
> revents=POLLIN}])****
>
> read(8, "+2+1+0+0+1", 262144)           = 10****
>
> write(8, "+2+12+15+9Scheduler2+38302551.cl"..., 67) = 67****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout)****
>
> write(8, "+2+12+11+9Scheduler+2+22+3830255"..., 139) = 139****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 0 (Timeout)****
>
> write(8, "+2+12+24+9Scheduler+0+12+13nodes"..., 42) = 42****
>
> poll([{fd=8, events=POLLIN|POLLHUP}], 1, 20000) = 1 ([{fd=8,
> revents=POLLIN}])****
>
> read(8, "+2+1+0+0+1", 262144)           = 10****
>
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---****
>
> ** **
>
> If there is any other information I can provide, please let me know as
> this is reproducible.
>
> ****
>
> ** **
>
> --****
>
> Mike Stevens ****
>
> Senior UNIX Administrator ****
>
> Affymetrix | 3420 Central Expressway | Santa Clara, CA 95051 ****
>
> Tel: 408-731-5804 | Cell: 408-507-5738****
>
> ** **
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
David Beer | Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120420/0d8e6fa0/attachment-0001.html 


More information about the torqueusers mailing list