[torquedev] pbs_server segfault in req_delete.c
jbernstein at penguincomputing.com
Mon Dec 29 12:26:02 MST 2008
Glen Beane wrote:
> On Mon, Dec 29, 2008 at 2:08 PM, Joshua Bernstein
> <jbernstein at penguincomputing.com> wrote:
>> Garrick Staples wrote:
>>> On Tue, Dec 23, 2008 at 03:47:38PM -0800, Joshua Bernstein alleged:
>>>> I've been able to reproduce this by submitting jobs (a simple echo
>>>> "HELLO") out of a directory that isn't known to pbs_mom. (ie: something not
>>>> listed in mom_priv/config). In my case I just use /tmp on the headnode. This
>>>> causes the job to enter the "E", or exiting state and thus hang out in the
>>>> queue until the remote copy times out. At this
>>> Why is the remote copy hanging? You have scp setup for the users, right?
>>> you have port filtering dropping ssh packets from the nodes? My users do
>>> exact same thing routinely without a problem.
>> As I mentioned before, I've purposely broken the remote copy mechanism in
>> order to recreate a scenario several customers were facing. While most
>> people were happy to fix the staging issue, some still wanted to know why at
>> any point, even if something was done incorrectly, pbs_server would still
>> segfault, often oven right away. Thus I think its important to fix this,
>> either by fixing up pbs_server's code, or modifying qdel, to prevent it from
>> doing the "wrong" thing when pbs_mom are still available.
>> My guess, from looking at the code is that the "stuck jobs" are left in a
>> weird state insofar as the pbs_mom is concerned. The job is partly removed
>> from the offending pbs_mom, but incompletely removed on pbs_server, So:
>> pwtiter = (struct work_task*)GET_NEXT(pjob->ji_svrtask); (req_delete.c)
>> Gets a segfault on the half-dismantled queue element.
>> Like I said, maybe theres a fix there, or maybe some enchancement needs to
>> be done to qdel -p's logic.
>> Is there a public bug tracker anywhere for TORQUE, that we as the community
>> can use to file and track these sorts of release against future versions of
> there is a publicly available TORQUE bugzilla but it hasn't been
> maintained and all the bugs entries are several years old now and
> logged against obsolete versions of TORQUE. Perhaps it is time to
> revive TORQUE Bugzilla.
YAY!!!! If thats a proposal, then I'd like to second it! If its a matter
of resources, I'll even volunteer to help maintain it.
More information about the torquedev