[torqueusers] torque 2.1.0p0

Sam Rash srash at yahoo-inc.com
Thu Jun 15 21:23:20 MDT 2006


Yet more info:

 

Using the older torque scheduler (confirmed 2.0.0p8 with latest patches)
still results in pbs_sched crashing with a seg fault. I should add that this
crash occurs when MANY (512 up 2000+) jobs are in the system and many finish
within seconds.

 

1)       again, has anyone seen this behavior in either 2.0.0 or 2.1.0?  

2)       any suggested fixes/work-arounds?

3)       I assume this would go away if we moved to maui (bumps up the
priority of making this change quite a bit)

 

Again, I appreciate any help on this matter.

 

Regards, 

Sam Rash

srash at yahoo-inc.com

408-349-7312

vertigosr37

  _____  

From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sam Rash
Sent: Thursday, June 15, 2006 7:54 PM
To: torqueusers at supercluster.org
Subject: FW: [torqueusers] torque 2.1.0p0

 

As a side-node, I downgraded just the scheduler daemon (pbs_sched) back to
the older one and things *seem* to work fine.  I do notice that certain
commands such as pbsnodes -a or qstat -Q hang for several seconds.  However,
the submission time is MUCH faster (512 jobs submit in 1-2 min vs 10-15 min
before).

 

All of this should go away as soon as we migrate to maui (vs the simple
pbs_sched fifo module)

 

If anyone can help with the seg fault -or- confirm that using the previous
release's pbs_sched will not impact performance or stability negatively,
that would be great.

 

 

Sam Rash

srash at yahoo-inc.com

408-349-7312

vertigosr37

  _____  

From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sam Rash
Sent: Thursday, June 15, 2006 11:30 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] torque 2.1.0p0

 

Hi,


I recently built torque 2.1.0p0 on FreeBSD without any problems.  However,
in running the new pbs_sched daemon, I see a seg fault periodically.  Here
is the tail of an strace:

 

fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)

read(7, "+2+1+0+0+9+1+1+0+0+0", 65536)  = 20

write(7, "+2+12+11+9Scheduler+2+22+384513."..., 124) = 124

poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1

fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)

read(7, "+2+1+0+0+1", 65536)            = 10

write(7, "+2+12+15+9Scheduler2+384513.medi"..., 67) = 67

poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1

fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)

read(7, "+2+1+0+0+1", 65536)            = 10

gettimeofday({1150395690, 529624}, NULL) = 0

write(3, "06/15/2006 11:21:30;0040; pbs_sc"..., 87) = 87

write(7, "+2+12+24+9Scheduler+0+1+7nodes=1"..., 34) = 34

poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1

fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)

read(7, "+2+1+0+0+1", 65536)            = 10

--- SIGSEGV (Segmentation fault) ---

--- SIGSEGV (Segmentation fault) ---

--- 139 (Unknown signal: 139) ---

PIOCRUN: Inappropriate ioctl for device

 

 

I can generate more on the matter, but does this problem look at all
familiar to anyone?  (either on freebsd or any other system).

The previous torque build, 2.0.8p16  (I think that's the right #) worked
fine.

 

Thanks in advance,

 

Sam Rash

srash at yahoo-inc.com

408-349-7312

vertigosr37

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060615/ce6356a5/attachment.html


More information about the torqueusers mailing list