[torquedev] pbs_server segfault in req_delete.c

Glen Beane glen.beane at gmail.com
Mon Dec 29 12:21:39 MST 2008


On Mon, Dec 29, 2008 at 2:08 PM, Joshua Bernstein
<jbernstein at penguincomputing.com> wrote:
>
>
> Garrick Staples wrote:
>>
>> On Tue, Dec 23, 2008 at 03:47:38PM -0800, Joshua Bernstein alleged:
>>>
>>> I've been able to reproduce this by submitting jobs (a simple echo
>>> "HELLO") out of a directory that isn't known to pbs_mom. (ie: something not
>>> listed in mom_priv/config). In my case I just use /tmp on the headnode. This
>>> causes the job to enter the "E", or exiting state and thus hang out in the
>>> queue until the remote copy times out. At this
>>
>> Why is the remote copy hanging?  You have scp setup for the users, right?
>>  Do
>> you have port filtering dropping ssh packets from the nodes?  My users do
>> this
>> exact same thing routinely without a problem.
>
> As I mentioned before, I've purposely broken the remote copy mechanism in
> order to recreate a scenario several customers were facing. While most
> people were happy to fix the staging issue, some still wanted to know why at
> any point, even if something was done incorrectly, pbs_server would still
> segfault, often oven right away. Thus I think its important to fix this,
> either by fixing up pbs_server's code, or modifying qdel, to prevent it from
> doing the "wrong" thing when pbs_mom are still available.
>
> My guess, from looking at the code is that the "stuck jobs" are left in a
> weird state insofar as the pbs_mom is concerned. The job is partly removed
> from the offending pbs_mom, but incompletely removed on pbs_server, So:
>
> pwtiter = (struct work_task*)GET_NEXT(pjob->ji_svrtask); (req_delete.c)
>
> Gets a segfault on the half-dismantled queue element.
>
> Like I said, maybe theres a fix there, or maybe some enchancement needs to
> be done to qdel -p's logic.
>
> Is there a public bug tracker anywhere for TORQUE, that we as the community
> can use to file and track these sorts of release against future versions of
> TORQUE?

there is a publicly available TORQUE bugzilla but it hasn't been
maintained and all the bugs entries are several years old now and
logged against obsolete versions of TORQUE.  Perhaps it is time to
revive TORQUE Bugzilla.


More information about the torquedev mailing list