[torquedev] failing job init is fubar

Garrick Staples garrick at clusterresources.com
Wed Mar 7 16:11:49 MST 2007


Turns out that any failure to initialize a job when pbs_server is
restarting is entirely mishandled and generally causes it to segfault.

An easy to trigger this is to create a temp execution queue, submit a
job to that queue, stop pbs_server, remove the queue state file, and
start pbs_server again.  Trying to reenque the job into a non-existing
queue fails the job init.

pbsd_init_job() 
  -> pbsd_init_reque() 
       -> svr_enquejob()
            <- returns PBSE_UNKQUE
       -> job_abt()
            <- returns after completely free()ing the job struct
       <- returns void
  has no idea anything went wrong and continues to access pjob and blows
  up.

So, um, this might be fun for someone else to fix :)



More information about the torquedev mailing list