[torquedev] pbs_server segfault in req_delete.c
jbernstein at penguincomputing.com
Mon Dec 29 12:02:33 MST 2008
Garrick Staples wrote:
> On Tue, Dec 23, 2008 at 03:47:38PM -0800, Joshua Bernstein alleged:
>> Hello TORQUE Fans!
>> Remember me? I figured I'd drop one more observed and repeatable
>> segfault before we all went on a break for the holidays. This time
>> though it seems to be inside of pbs_server. I'm running on X86_64, and
>> I've been able to reproduce this problem in both version 2.3.3 and the
>> brand shinny new 2.3.6.
>> Essentially, if you issue a qdel -p to clear the queue of stale
>> jobs, pbs_server appears to continue to operate normally, but shortly after
>> new jobs get submitted to the queue, pbs_server posts this message and dies.
>> Assertion failed, bad pointer in link: file "req_delete.c", line 844
>> Aborted (core dumped)
> While segfaults need to always be fixed, you are using qdel -p incorrectly. It
> should only be used if a running job will not exit because its allocated nodes
> are unreachable.
Well if I'm using it incorrectly, then I think qdel should error out
with a message like:
"Ignoring request: pbs_mom is reachable"
The system shouldn't let me do something that isn't allowed, or isn't
> qdel -p is a very bad thing to do. It is intentionally breaking pbs_server's
> idea of what is going on.
> Since you are using qdel -p when you have a running pbs_mom that has the job,
> you are bound to have bad things happen.
I recognize that. But customers, and others that manage TORQUE systems
that are less educated don't.
Still the segv should be fixed, or be prevented by making qdel -p aware
that its the wrong thing to do when pbs_mom is available. Perhaps
instead of just printing a warning, it could also simply ignore the -p
option, print an warning, and just issue a qdel.
More information about the torquedev