[torqueusers] pbs_server segfaults

Dave Jackson jacksond at supercluster.org
Wed Nov 3 14:26:45 MST 2004


Jason,

  If one of a job's dependencies become unreacheable, this routine can
abort the job and clear the job pointer.  Unfortunately, there was no
check at the end to see if this occured before doing some final job
processing at the end of the routine.  This is now corrected in the
torque-1.1.0p5 snapshot.  The corrected code is also listed below:

  Thanks for bring this to our attention!

Dave

# req_register.c
-------

 if (rc)
    {
    if (pjob != NULL)
      pjob->ji_modified = 0;

    req_reject(rc,0,preq,NULL,NULL);
    }
  else
    {
    if ((pjob != NULL) && (pjob->ji_modified != 0))
      job_save(pjob,SAVEJOB_FULL);

    reply_ack(preq);
    }
-------
On Wed, 2004-11-03 at 11:27, Jason Allen wrote:
> We just upgraded our 225 node cluster from torque-1.1.0p1 to
> torque-1.1.0p4 and we are now seeing the pbs_server process crash
> intermittently. We currently have about 300 jobs in the queue and the
> server dies every 2 - 30 mins.
> 
> After running pbs_server in gdb it looks like there is a problem
> handling job requests.
>  
> Program received signal SIGSEGV, Segmentation fault.
> 0x0805f204 in req_register (preq=0x96fd0c8) at req_register.c:498
> 498         if (pjob->ji_modified)
> (gdb) quit
> 
> 
> Has anyone else seen this? 
> 
> Thanks!
> 
> Jason Allen
> Fermilab 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list