[torqueusers] segfault loop

David Jackson jacksond at clusterresources.com
Thu Apr 21 10:29:59 MDT 2005


Garrick,

  Were you able to reproduce this or track this down?

Dave

On Tue, 2005-04-12 at 16:52 -0700, Garrick Staples wrote:
> I've got a new bug here.  When starting a job, the child process that is
> about to run prologue segfaults.  pbs_server continues to start the job and the
> prologue child continues to segfault.  Eventually the job gets assigned to
> another node and starts up fine.
> 
> So the loop is: pbs_server sends the job to MS, transfers the files, and tells
> MS to start the job.  MS forks to run prologue.  The child segfaults.  MS sends
> the failure back to pbs_server.  pbs_server tells MS to delete the job, and it
> starts over at the beginning.
> 
> This happens to several nodes at once.  A quick pbs_mom restart fixes the
> problem.  Another job, or a job from another user, can start fine.  So there is
> something about the initial job prep in MS that is failing and is cached across
> job starts... and it must be triggered from a shared resource, like DNS or NFS.
> 
> 
> looping mom logs...
> 04/12/2005 15:57:52;0001;   pbs_mom;Job;225877.hpc-pbs.usc.edu;phase 2 of job launch successfully completed
> 04/12/2005 15:57:52;0001;   pbs_mom;Job;225877.hpc-pbs.usc.edu;job not ready after 0 second timeout, MOM will recheck
> 04/12/2005 15:57:52;0008;   pbs_mom;Job;scan_for_terminated;pid 1573 not tracked, exitcode=11
> 04/12/2005 15:57:52;0001;   pbs_mom;Svr;pbs_mom;No child processes (10) in TMomFinalizeJob3, read of pipe for sid failed for job 225877.hpc-pbs.usc.edu (0 of 8 bytes)
> 04/12/2005 15:57:52;0001;   pbs_mom;Job;TMomFinalizeJob3;start failed, improper sid
> 04/12/2005 15:57:52;0001;   pbs_mom;Job;225877.hpc-pbs.usc.edu;ALERT:  job failed phase 3 start, server will retry
> 
> strace output...
> 24044 close(11)                         = 0
> 24044 munlockall()                      = 0
> 24044 close(16)                         = 0
> 24044 close(15)                         = 0
> 24044 --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> 
> 
> The munlockall() call is in fork_me() being called from TMomFinalizeJob2().
> The last 2 close() calls must be in TMomFinalizeChild().  Perhaps it is
> segfaulting inside of set_shell()?
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list