[torquedev] pbs_server segfault in req_delete.c

Joshua Bernstein jbernstein at penguincomputing.com
Mon Dec 29 12:08:42 MST 2008



Garrick Staples wrote:
> On Tue, Dec 23, 2008 at 03:47:38PM -0800, Joshua Bernstein alleged:
>> I've been able to reproduce this by submitting jobs (a simple echo 
>> "HELLO") out of a directory that isn't known to pbs_mom. (ie: something 
>> not listed in mom_priv/config). In my case I just use /tmp on the 
>> headnode. This causes the job to enter the "E", or exiting state and 
>> thus hang out in the queue until the remote copy times out. At this 
> 
> Why is the remote copy hanging?  You have scp setup for the users, right?  Do
> you have port filtering dropping ssh packets from the nodes?  My users do this
> exact same thing routinely without a problem.

As I mentioned before, I've purposely broken the remote copy mechanism 
in order to recreate a scenario several customers were facing. While 
most people were happy to fix the staging issue, some still wanted to 
know why at any point, even if something was done incorrectly, 
pbs_server would still segfault, often oven right away. Thus I think its 
important to fix this, either by fixing up pbs_server's code, or 
modifying qdel, to prevent it from doing the "wrong" thing when pbs_mom 
are still available.

My guess, from looking at the code is that the "stuck jobs" are left in 
a weird state insofar as the pbs_mom is concerned. The job is partly 
removed from the offending pbs_mom, but incompletely removed on 
pbs_server, So:

pwtiter = (struct work_task*)GET_NEXT(pjob->ji_svrtask); (req_delete.c)

Gets a segfault on the half-dismantled queue element.

Like I said, maybe theres a fix there, or maybe some enchancement needs 
to be done to qdel -p's logic.

Is there a public bug tracker anywhere for TORQUE, that we as the 
community can use to file and track these sorts of release against 
future versions of TORQUE?

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torquedev mailing list