[torquedev] pbm_mom segfault in TMomCheckJobChild
josh at clusterresources.com
Mon Dec 22 13:10:05 MST 2008
I figure I might as well put in my two bits. :)
First of all, I believe Josh Bernstein when he says that he sees a crash
and then, after making changes to the code, the crash goes away. I can't
think of any reason he'd lie to us! :)
With that said, however, I must agree that it is unclear why the changes
Josh is making would prevent a segfault. Like Glen said, there must be
something else going on, but fixing the code like this changes behavior
slightly or does something to help reduce/eliminate a crash. Maybe it
offsets a race condition or something? That seems doubtful. I'm just not
sure (haven't had the proper time to look into it). That's why I think
we should continue trying to pin down exactly what is going on and fix
it in a way that we can better confirm we are fixing an observable
behavior leading to the crash. I contacted Josh out of band and got some
of the core files from him to try and get more info about the crash.
> I agree Michael. Up to this point, I've only experienced a few minor
> bugs in TORQUE, but certainly nothing this drastic. I believe that this
> particular issue revolves around the workload, many thousands of short,
> small jobs. I imagine on a much larger cluster, such as yours, the jobs
> run for much longer.
> That all said, I'm surprised that others haven't reported similar
> issues. Though, just because nobody has reported it, doesn't mean its
> not happening. It may just mean they've switched to something like SGE
> or PBSPro. At least thats what I've heard from others. ;-)
As to whether others have seen a problem like this, I can say that
similar issue *has* been experienced by customers who, like Josh, try to
run thousands of very short jobs (1-5 seconds) as fast as possible. Most
of the problems we saw, however, were actually in the pbs_server and not
the pbs_mom. Basically, the pbs_server would run out of sockets and
errors would show up in the log file about not being able to open up
other kinds of files. Also, sometimes the socket descriptors themselves
would not leak, but the socket handlers inside of TORQUE would leak.
These problems, so far, have been addressed by two "workarounds":
1) Making TORQUE so that it can support more than 1024 descriptors at a
2) Increasing the size of the socket handler table.
Both of these fixes are only currently present in the 2.3-extreme and
The problem seemed to occur in our labs in pbs_server when you have a
LOT of jobs (thousands) all running at the same time (or close to the
same time). I bet if you had a lot of jobs running on one pbs_mom you
could have a similar problem happening in the other direction.
Anyway, if Josh has a reproducible case, which it sounds like he does,
this is a perfect time to fix the problem. We could never easily
reproduce the issue.
One question remains for me remains: do we wait to release 2.3.6 for
this fix, considering that it may take until after the holidays to fully
understand and fix? Or do we try and roll out 2.3.6 before Christmas
hits? Most sane admins wouldn't upgrade before the holidays anyway ...
but who knows. :)
More information about the torquedev