[torquedev] pbs_server segfault in req_delete.c

Glen Beane glen.beane at gmail.com
Mon Dec 29 12:30:49 MST 2008


On Mon, Dec 29, 2008 at 2:26 PM, Joshua Bernstein
<jbernstein at penguincomputing.com> wrote:
>
>
> Glen Beane wrote:
>>
>> On Mon, Dec 29, 2008 at 2:08 PM, Joshua Bernstein
>> <jbernstein at penguincomputing.com> wrote:
>>>
>>> Garrick Staples wrote:
>>>>
>>>> On Tue, Dec 23, 2008 at 03:47:38PM -0800, Joshua Bernstein alleged:
>>>>>
>>>>> I've been able to reproduce this by submitting jobs (a simple echo
>>>>> "HELLO") out of a directory that isn't known to pbs_mom. (ie: something
>>>>> not
>>>>> listed in mom_priv/config). In my case I just use /tmp on the headnode.
>>>>> This
>>>>> causes the job to enter the "E", or exiting state and thus hang out in
>>>>> the
>>>>> queue until the remote copy times out. At this
>>>>
>>>> Why is the remote copy hanging?  You have scp setup for the users,
>>>> right?
>>>>  Do
>>>> you have port filtering dropping ssh packets from the nodes?  My users
>>>> do
>>>> this
>>>> exact same thing routinely without a problem.
>>>
>>> As I mentioned before, I've purposely broken the remote copy mechanism in
>>> order to recreate a scenario several customers were facing. While most
>>> people were happy to fix the staging issue, some still wanted to know why
>>> at
>>> any point, even if something was done incorrectly, pbs_server would still
>>> segfault, often oven right away. Thus I think its important to fix this,
>>> either by fixing up pbs_server's code, or modifying qdel, to prevent it
>>> from
>>> doing the "wrong" thing when pbs_mom are still available.
>>>
>>> My guess, from looking at the code is that the "stuck jobs" are left in a
>>> weird state insofar as the pbs_mom is concerned. The job is partly
>>> removed
>>> from the offending pbs_mom, but incompletely removed on pbs_server, So:
>>>
>>> pwtiter = (struct work_task*)GET_NEXT(pjob->ji_svrtask); (req_delete.c)
>>>
>>> Gets a segfault on the half-dismantled queue element.
>>>
>>> Like I said, maybe theres a fix there, or maybe some enchancement needs
>>> to
>>> be done to qdel -p's logic.
>>>
>>> Is there a public bug tracker anywhere for TORQUE, that we as the
>>> community
>>> can use to file and track these sorts of release against future versions
>>> of
>>> TORQUE?
>>
>> there is a publicly available TORQUE bugzilla but it hasn't been
>> maintained and all the bugs entries are several years old now and
>> logged against obsolete versions of TORQUE.  Perhaps it is time to
>> revive TORQUE Bugzilla.
>
> YAY!!!! If thats a proposal, then I'd like to second it! If its a matter of
> resources, I'll even volunteer to help maintain it.


What do you think Josh (CRI Josh, not Penguin Josh!)?  I think we
would want to upgrade the bugzilla at
www.clusterresources.com/bugzilla and probably flush out all the old
bugs and start fresh since TORQUE has changed so much since they were
loggeed..


More information about the torquedev mailing list