[torquedev] pbs_server segfault in req_delete.c
Joshua Bernstein
jbernstein at penguincomputing.com
Mon Dec 29 12:08:42 MST 2008
Garrick Staples wrote:
> On Tue, Dec 23, 2008 at 03:47:38PM -0800, Joshua Bernstein alleged:
>> I've been able to reproduce this by submitting jobs (a simple echo
>> "HELLO") out of a directory that isn't known to pbs_mom. (ie: something
>> not listed in mom_priv/config). In my case I just use /tmp on the
>> headnode. This causes the job to enter the "E", or exiting state and
>> thus hang out in the queue until the remote copy times out. At this
>
> Why is the remote copy hanging? You have scp setup for the users, right? Do
> you have port filtering dropping ssh packets from the nodes? My users do this
> exact same thing routinely without a problem.
As I mentioned before, I've purposely broken the remote copy mechanism
in order to recreate a scenario several customers were facing. While
most people were happy to fix the staging issue, some still wanted to
know why at any point, even if something was done incorrectly,
pbs_server would still segfault, often oven right away. Thus I think its
important to fix this, either by fixing up pbs_server's code, or
modifying qdel, to prevent it from doing the "wrong" thing when pbs_mom
are still available.
My guess, from looking at the code is that the "stuck jobs" are left in
a weird state insofar as the pbs_mom is concerned. The job is partly
removed from the offending pbs_mom, but incompletely removed on
pbs_server, So:
pwtiter = (struct work_task*)GET_NEXT(pjob->ji_svrtask); (req_delete.c)
Gets a segfault on the half-dismantled queue element.
Like I said, maybe theres a fix there, or maybe some enchancement needs
to be done to qdel -p's logic.
Is there a public bug tracker anywhere for TORQUE, that we as the
community can use to file and track these sorts of release against
future versions of TORQUE?
-Joshua Bernstein
Software Engineer
Penguin Computing
More information about the torquedev
mailing list