[torqueusers] segfault loop

Garrick Staples garrick at usc.edu
Fri Apr 22 10:28:52 MDT 2005


On Thu, Apr 21, 2005 at 10:29:59AM -0600, David Jackson alleged:
> Garrick,
> 
>   Were you able to reproduce this or track this down?

Nope.  I've had 4 such "storms" of a few dozen jobs getting into this loop at
once.  But none that I've seen since I reported the problem.

Also, this happened in the first days after expanding the cluster to 1716
nodes.  But it's been otherwise performing beautifully.  The recent job polling
changes works great with a 1000 running jobs!


> On Tue, 2005-04-12 at 16:52 -0700, Garrick Staples wrote:
> > I've got a new bug here.  When starting a job, the child process that is
> > about to run prologue segfaults.  pbs_server continues to start the job and the
> > prologue child continues to segfault.  Eventually the job gets assigned to
> > another node and starts up fine.
> > 
> > So the loop is: pbs_server sends the job to MS, transfers the files, and tells
> > MS to start the job.  MS forks to run prologue.  The child segfaults.  MS sends
> > the failure back to pbs_server.  pbs_server tells MS to delete the job, and it
> > starts over at the beginning.
> > 
> > This happens to several nodes at once.  A quick pbs_mom restart fixes the
> > problem.  Another job, or a job from another user, can start fine.  So there is
> > something about the initial job prep in MS that is failing and is cached across
> > job starts... and it must be triggered from a shared resource, like DNS or NFS.
> > 
> > 
> > looping mom logs...
> > 04/12/2005 15:57:52;0001;   pbs_mom;Job;225877.hpc-pbs.usc.edu;phase 2 of job launch successfully completed
> > 04/12/2005 15:57:52;0001;   pbs_mom;Job;225877.hpc-pbs.usc.edu;job not ready after 0 second timeout, MOM will recheck
> > 04/12/2005 15:57:52;0008;   pbs_mom;Job;scan_for_terminated;pid 1573 not tracked, exitcode=11
> > 04/12/2005 15:57:52;0001;   pbs_mom;Svr;pbs_mom;No child processes (10) in TMomFinalizeJob3, read of pipe for sid failed for job 225877.hpc-pbs.usc.edu (0 of 8 bytes)
> > 04/12/2005 15:57:52;0001;   pbs_mom;Job;TMomFinalizeJob3;start failed, improper sid
> > 04/12/2005 15:57:52;0001;   pbs_mom;Job;225877.hpc-pbs.usc.edu;ALERT:  job failed phase 3 start, server will retry
> > 
> > strace output...
> > 24044 close(11)                         = 0
> > 24044 munlockall()                      = 0
> > 24044 close(16)                         = 0
> > 24044 close(15)                         = 0
> > 24044 --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> > 
> > 
> > The munlockall() call is in fork_me() being called from TMomFinalizeJob2().
> > The last 2 close() calls must be in TMomFinalizeChild().  Perhaps it is
> > segfaulting inside of set_shell()?
> > 
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://supercluster.org/mailman/listinfo/torqueusers
> 

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050422/bf4ee01c/attachment.bin


More information about the torqueusers mailing list