[torqueusers] segfault loop

Garrick Staples garrick at usc.edu
Tue Apr 12 17:52:32 MDT 2005


I've got a new bug here.  When starting a job, the child process that is
about to run prologue segfaults.  pbs_server continues to start the job and the
prologue child continues to segfault.  Eventually the job gets assigned to
another node and starts up fine.

So the loop is: pbs_server sends the job to MS, transfers the files, and tells
MS to start the job.  MS forks to run prologue.  The child segfaults.  MS sends
the failure back to pbs_server.  pbs_server tells MS to delete the job, and it
starts over at the beginning.

This happens to several nodes at once.  A quick pbs_mom restart fixes the
problem.  Another job, or a job from another user, can start fine.  So there is
something about the initial job prep in MS that is failing and is cached across
job starts... and it must be triggered from a shared resource, like DNS or NFS.


looping mom logs...
04/12/2005 15:57:52;0001;   pbs_mom;Job;225877.hpc-pbs.usc.edu;phase 2 of job launch successfully completed
04/12/2005 15:57:52;0001;   pbs_mom;Job;225877.hpc-pbs.usc.edu;job not ready after 0 second timeout, MOM will recheck
04/12/2005 15:57:52;0008;   pbs_mom;Job;scan_for_terminated;pid 1573 not tracked, exitcode=11
04/12/2005 15:57:52;0001;   pbs_mom;Svr;pbs_mom;No child processes (10) in TMomFinalizeJob3, read of pipe for sid failed for job 225877.hpc-pbs.usc.edu (0 of 8 bytes)
04/12/2005 15:57:52;0001;   pbs_mom;Job;TMomFinalizeJob3;start failed, improper sid
04/12/2005 15:57:52;0001;   pbs_mom;Job;225877.hpc-pbs.usc.edu;ALERT:  job failed phase 3 start, server will retry

strace output...
24044 close(11)                         = 0
24044 munlockall()                      = 0
24044 close(16)                         = 0
24044 close(15)                         = 0
24044 --- SIGSEGV (Segmentation fault) @ 0 (0) ---


The munlockall() call is in fork_me() being called from TMomFinalizeJob2().
The last 2 close() calls must be in TMomFinalizeChild().  Perhaps it is
segfaulting inside of set_shell()?

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050412/00d03e43/attachment.bin


More information about the torqueusers mailing list