[torqueusers] torque 2.1.0p0
Sam Rash
srash at yahoo-inc.com
Thu Jun 15 22:05:18 MDT 2006
No to beat a dead horse, but after compiling with debug info, I have this
info.
It looks like the pbs_sched isn't handling a reply from the server regarding
a resource query.???
sudo gdb /home/y/sbin/pbs_sched
GNU gdb 4.18 (FreeBSD)
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-unknown-freebsd"...Deprecated bfd_read
called at
/home/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c
line 2627 in elfstab_build_psymtabs
Deprecated bfd_read called at
/home/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c
line 933 in fill_symbuf
(gdb) core ./pbs_sched.core
Core was generated by `pbs_sched'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /usr/lib/libkvm.so.2...done.
Reading symbols from /usr/lib/libc.so.4...done.
Reading symbols from /usr/libexec/ld-elf.so.1...done.
#0 0x100a287 in pbs_rescquery (c=0, resclist=0x9fbff9ac, num_resc=1,
available=0x9fbff9b0,
allocated=0x9fbff9b4, reserved=0x9fbff9b8, down=0x9fbff9bc) at
../Libifl/pbsD_resc.c:218
218 *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);
(gdb) bt
#0 0x100a287 in pbs_rescquery (c=0, resclist=0x9fbff9ac, num_resc=1,
available=0x9fbff9b0,
allocated=0x9fbff9b4, reserved=0x9fbff9b8, down=0x9fbff9bc) at
../Libifl/pbsD_resc.c:218
#1 0x1008ef7 in check_nodes (pbs_sd=0, jinfo=0x1127080, ninfo_arr=0x0) at
check.c:493
#2 0x1008b99 in is_ok_to_run_job (pbs_sd=0, sinfo=0x102c080,
qinfo=0x102c600, jinfo=0x1127080)
at check.c:183
#3 0x100354c in scheduling_cycle (sd=0) at fifo.c:412
#4 0x10033e7 in schedule (cmd=2, sd=0) at fifo.c:346
#5 0x1002ea9 in main (argc=1, argv=0x9fbffde8) at pbs_sched.c:1036
Sam Rash
srash at yahoo-inc.com
408-349-7312
vertigosr37
_____
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sam Rash
Sent: Thursday, June 15, 2006 8:23 PM
To: torqueusers at supercluster.org
Subject: RE: [torqueusers] torque 2.1.0p0
Yet more info:
Using the older torque scheduler (confirmed 2.0.0p8 with latest patches)
still results in pbs_sched crashing with a seg fault. I should add that this
crash occurs when MANY (512 up 2000+) jobs are in the system and many finish
within seconds.
1) again, has anyone seen this behavior in either 2.0.0 or 2.1.0?
2) any suggested fixes/work-arounds?
3) I assume this would go away if we moved to maui (bumps up the
priority of making this change quite a bit)
Again, I appreciate any help on this matter.
Regards,
Sam Rash
srash at yahoo-inc.com
408-349-7312
vertigosr37
_____
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sam Rash
Sent: Thursday, June 15, 2006 7:54 PM
To: torqueusers at supercluster.org
Subject: FW: [torqueusers] torque 2.1.0p0
As a side-node, I downgraded just the scheduler daemon (pbs_sched) back to
the older one and things *seem* to work fine. I do notice that certain
commands such as pbsnodes -a or qstat -Q hang for several seconds. However,
the submission time is MUCH faster (512 jobs submit in 1-2 min vs 10-15 min
before).
All of this should go away as soon as we migrate to maui (vs the simple
pbs_sched fifo module)
If anyone can help with the seg fault -or- confirm that using the previous
release's pbs_sched will not impact performance or stability negatively,
that would be great.
Sam Rash
srash at yahoo-inc.com
408-349-7312
vertigosr37
_____
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sam Rash
Sent: Thursday, June 15, 2006 11:30 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] torque 2.1.0p0
Hi,
I recently built torque 2.1.0p0 on FreeBSD without any problems. However,
in running the new pbs_sched daemon, I see a seg fault periodically. Here
is the tail of an strace:
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+9+1+1+0+0+0", 65536) = 20
write(7, "+2+12+11+9Scheduler+2+22+384513."..., 124) = 124
poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+1", 65536) = 10
write(7, "+2+12+15+9Scheduler2+384513.medi"..., 67) = 67
poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+1", 65536) = 10
gettimeofday({1150395690, 529624}, NULL) = 0
write(3, "06/15/2006 11:21:30;0040; pbs_sc"..., 87) = 87
write(7, "+2+12+24+9Scheduler+0+1+7nodes=1"..., 34) = 34
poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+1", 65536) = 10
--- SIGSEGV (Segmentation fault) ---
--- SIGSEGV (Segmentation fault) ---
--- 139 (Unknown signal: 139) ---
PIOCRUN: Inappropriate ioctl for device
I can generate more on the matter, but does this problem look at all
familiar to anyone? (either on freebsd or any other system).
The previous torque build, 2.0.8p16 (I think that's the right #) worked
fine.
Thanks in advance,
Sam Rash
srash at yahoo-inc.com
408-349-7312
vertigosr37
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060615/5cb6f500/attachment-0001.html
More information about the torqueusers
mailing list