[torqueusers] torque 2.1.0p0
Sam Rash
srash at yahoo-inc.com
Thu Jun 15 21:23:20 MDT 2006
Yet more info:
Using the older torque scheduler (confirmed 2.0.0p8 with latest patches)
still results in pbs_sched crashing with a seg fault. I should add that this
crash occurs when MANY (512 up 2000+) jobs are in the system and many finish
within seconds.
1) again, has anyone seen this behavior in either 2.0.0 or 2.1.0?
2) any suggested fixes/work-arounds?
3) I assume this would go away if we moved to maui (bumps up the
priority of making this change quite a bit)
Again, I appreciate any help on this matter.
Regards,
Sam Rash
srash at yahoo-inc.com
408-349-7312
vertigosr37
_____
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sam Rash
Sent: Thursday, June 15, 2006 7:54 PM
To: torqueusers at supercluster.org
Subject: FW: [torqueusers] torque 2.1.0p0
As a side-node, I downgraded just the scheduler daemon (pbs_sched) back to
the older one and things *seem* to work fine. I do notice that certain
commands such as pbsnodes -a or qstat -Q hang for several seconds. However,
the submission time is MUCH faster (512 jobs submit in 1-2 min vs 10-15 min
before).
All of this should go away as soon as we migrate to maui (vs the simple
pbs_sched fifo module)
If anyone can help with the seg fault -or- confirm that using the previous
release's pbs_sched will not impact performance or stability negatively,
that would be great.
Sam Rash
srash at yahoo-inc.com
408-349-7312
vertigosr37
_____
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Sam Rash
Sent: Thursday, June 15, 2006 11:30 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] torque 2.1.0p0
Hi,
I recently built torque 2.1.0p0 on FreeBSD without any problems. However,
in running the new pbs_sched daemon, I see a seg fault periodically. Here
is the tail of an strace:
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+9+1+1+0+0+0", 65536) = 20
write(7, "+2+12+11+9Scheduler+2+22+384513."..., 124) = 124
poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+1", 65536) = 10
write(7, "+2+12+15+9Scheduler2+384513.medi"..., 67) = 67
poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+1", 65536) = 10
gettimeofday({1150395690, 529624}, NULL) = 0
write(3, "06/15/2006 11:21:30;0040; pbs_sc"..., 87) = 87
write(7, "+2+12+24+9Scheduler+0+1+7nodes=1"..., 34) = 34
poll([{fd=7, events=POLLIN|POLLHUP, revents=POLLIN}], 1, 20000) = 1
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
read(7, "+2+1+0+0+1", 65536) = 10
--- SIGSEGV (Segmentation fault) ---
--- SIGSEGV (Segmentation fault) ---
--- 139 (Unknown signal: 139) ---
PIOCRUN: Inappropriate ioctl for device
I can generate more on the matter, but does this problem look at all
familiar to anyone? (either on freebsd or any other system).
The previous torque build, 2.0.8p16 (I think that's the right #) worked
fine.
Thanks in advance,
Sam Rash
srash at yahoo-inc.com
408-349-7312
vertigosr37
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20060615/ce6356a5/attachment.html
More information about the torqueusers
mailing list