[torquedev] pbs_server segfault in req_delete.c

Joshua Bernstein jbernstein at penguincomputing.com
Mon Dec 29 12:02:33 MST 2008



Garrick Staples wrote:
> On Tue, Dec 23, 2008 at 03:47:38PM -0800, Joshua Bernstein alleged:
>> Hello TORQUE Fans!
>>
>> 	Remember me? I figured I'd drop one more observed and repeatable 
>> segfault before we all went on a break for the holidays. This time 
>> though it seems to be inside of pbs_server. I'm running on X86_64, and 
>> I've been able to reproduce this problem in both version 2.3.3 and the 
>> brand shinny new 2.3.6.
>>
>> 	Essentially, if you issue a qdel -p to clear the queue of stale 
>> 	jobs, pbs_server appears to continue to operate normally, but shortly after 
>> new jobs get submitted to the queue, pbs_server posts this message and dies.
>>
>> Assertion failed, bad pointer in link: file "req_delete.c", line 844
>> Aborted (core dumped)
> 
> While segfaults need to always be fixed, you are using qdel -p incorrectly.  It
> should only be used if a running job will not exit because its allocated nodes
> are unreachable.

Well if I'm using it incorrectly, then I think qdel should error out 
with a message like:

"Ignoring request: pbs_mom is reachable"

The system shouldn't let me do something that isn't allowed, or isn't 
"proper".

> qdel -p is a very bad thing to do.  It is intentionally breaking pbs_server's
> idea of what is going on.  
> 
> Since you are using qdel -p when you have a running pbs_mom that has the job,
> you are bound to have bad things happen.

I recognize that. But customers, and others that manage TORQUE systems 
that are less educated don't.

Still the segv should be fixed, or be prevented by making qdel -p aware 
that its the wrong thing to do when pbs_mom is available. Perhaps 
instead of just printing a warning, it could also simply ignore the -p 
option, print an warning, and just issue a qdel.

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torquedev mailing list