[torqueusers] torque 2.1.0p0

Sam Rash srash at yahoo-inc.com
Thu Jun 15 22:05:18 MDT 2006


No to beat a dead horse, but after compiling with debug info, I have this
info.

It looks like the pbs_sched isn't handling a reply from the server regarding
a resource query.???

 

 

sudo gdb /home/y/sbin/pbs_sched

GNU gdb 4.18 (FreeBSD)

Copyright 1998 Free Software Foundation, Inc.

GDB is free software, covered by the GNU General Public License, and you are

welcome to change it and/or distribute copies of it under certain
conditions.

Type "show copying" to see the conditions.

There is absolutely no warranty for GDB.  Type "show warranty" for details.

This GDB was configured as "i386-unknown-freebsd"...Deprecated bfd_read
called at
/home/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c
line 2627 in elfstab_build_psymtabs

Deprecated bfd_read called at
/home/src/gnu/usr.bin/binutils/gdb/../../../../contrib/gdb/gdb/dbxread.c
line 933 in fill_symbuf

 

 (gdb) core ./pbs_sched.core

Core was generated by `pbs_sched'.

Program terminated with signal 11, Segmentation fault.

Reading symbols from /usr/lib/libkvm.so.2...done.

Reading symbols from /usr/lib/libc.so.4...done.

Reading symbols from /usr/libexec/ld-elf.so.1...done.

#0  0x100a287 in pbs_rescquery (c=0, resclist=0x9fbff9ac, num_resc=1,
available=0x9fbff9b0,

    allocated=0x9fbff9b4, reserved=0x9fbff9b8, down=0x9fbff9bc) at
../Libifl/pbsD_resc.c:218

218           *(available + i) = *(reply->brp_un.brp_rescq.brq_avail + i);

(gdb) bt

#0  0x100a287 in pbs_rescquery (c=0, resclist=0x9fbff9ac, num_resc=1,
available=0x9fbff9b0,

    allocated=0x9fbff9b4, reserved=0x9fbff9b8, down=0x9fbff9bc) at
../Libifl/pbsD_resc.c:218

#1  0x1008ef7 in check_nodes (pbs_sd=0, jinfo=0x1127080, ninfo_arr=0x0) at
check.c:493

#2  0x1008b99 in is_ok_to_run_job (pbs_sd=0, sinfo=0x102c080,
qinfo=0x102c600, jinfo=0x1127080)

    at check.c:183

#3  0x100354c in scheduling_cycle (sd=0) at fifo.c:412

#4  0x10033e7 in schedule (cmd=2, sd=0) at fifo.c:346

#5  0x1002ea9 in main (argc=1, argv=0x9fbffde8) at pbs_sched.c:1036

 

Sam Rash

srash at yahoo-inc.com

408-349-7312

vertigosr37

  _____  

From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sam Rash
Sent: Thursday, June 15, 2006 8:23 PM
To: torqueusers at supercluster.org
Subject: RE: [torqueusers] torque 2.1.0p0

 

Yet more info:

 

Using the older torque scheduler (confirmed 2.0.0p8 with latest patches)
still results in pbs_sched crashing with a seg fault. I should add that this
crash occurs when MANY (512 up 2000+) jobs are in the system and many finish
within seconds.

 

1)       again, has anyone seen this behavior in either 2.0.0 or 2.1.0?  

2)       any suggested fixes/work-arounds?

3)       I assume this would go away if we moved to maui (bumps up the
priority of making this change quite a bit)

 

Again, I appreciate any help on this matter.

 

Regards, 

Sam Rash

srash at yahoo-inc.com

408-349-7312

vertigosr37

  _____  

From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sam Rash
Sent: Thursday, June 15, 2006 7:54 PM
To: torqueusers at supercluster.org
Subject: FW: [torqueusers] torque 2.1.0p0

 

As a side-node, I downgraded just the scheduler daemon (pbs_sched) back to
the older one and things *seem* to work fine.  I do notice that certain
commands such as pbsnodes -a or qstat -Q hang for several seconds.  However,
the submission time is MUCH faster (512 jobs submit in 1-2 min vs 10-15 min
before).

 

All of this should go away as soon as we migrate to maui (vs the simple
pbs_sched fifo module)

 

If anyone can help with the seg fault -or- confirm that using the previous
release's pbs_sched will not impact performance or stability negatively,
that would be great.

 

 

Sam Rash

srash at yahoo-inc.com

408-349-7312

vertigosr37

  _____  

From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sam Rash
Sent: Thursday, June 15, 2006 11:30 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] torque 2.1.0p0

 

Hi,


I recently built torque 2.1.0p0 on FreeBSD without any problems.  However,
in running the new pbs_sched daemon, I see a seg fault periodically.  Here
is the tail of an strace:

 

fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)

read(7, "+2+1+0+0+9+1+1+0+0+0", 65536)  = 20

write(7, "+2+12+11+9Scheduler+2+22+384513."..., 124) = 124

poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1

fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)

read(7, "+2+1+0+0+1", 65536)            = 10

write(7, "+2+12+15+9Scheduler2+384513.medi"..., 67) = 67

poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1

fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)

read(7, "+2+1+0+0+1", 65536)            = 10

gettimeofday({1150395690, 529624}, NULL) = 0

write(3, "06/15/2006 11:21:30;0040; pbs_sc"..., 87) = 87

write(7, "+2+12+24+9Scheduler+0+1+7nodes=1"..., 34) = 34

poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1

fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)

read(7, "+2+1+0+0+1", 65536)            = 10

--- SIGSEGV (Segmentation fault) ---

--- SIGSEGV (Segmentation fault) ---

--- 139 (Unknown signal: 139) ---

PIOCRUN: Inappropriate ioctl for device

 

 

I can generate more on the matter, but does this problem look at all
familiar to anyone?  (either on freebsd or any other system).

The previous torque build, 2.0.8p16  (I think that's the right #) worked
fine.

 

Thanks in advance,

 

Sam Rash

srash at yahoo-inc.com

408-349-7312

vertigosr37

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060615/5cb6f500/attachment-0001.html


More information about the torqueusers mailing list