[torqueusers] segfault loop
David Jackson
jacksond at clusterresources.com
Thu Apr 21 10:29:59 MDT 2005
Garrick,
Were you able to reproduce this or track this down?
Dave
On Tue, 2005-04-12 at 16:52 -0700, Garrick Staples wrote:
> I've got a new bug here. When starting a job, the child process that is
> about to run prologue segfaults. pbs_server continues to start the job and the
> prologue child continues to segfault. Eventually the job gets assigned to
> another node and starts up fine.
>
> So the loop is: pbs_server sends the job to MS, transfers the files, and tells
> MS to start the job. MS forks to run prologue. The child segfaults. MS sends
> the failure back to pbs_server. pbs_server tells MS to delete the job, and it
> starts over at the beginning.
>
> This happens to several nodes at once. A quick pbs_mom restart fixes the
> problem. Another job, or a job from another user, can start fine. So there is
> something about the initial job prep in MS that is failing and is cached across
> job starts... and it must be triggered from a shared resource, like DNS or NFS.
>
>
> looping mom logs...
> 04/12/2005 15:57:52;0001; pbs_mom;Job;225877.hpc-pbs.usc.edu;phase 2 of job launch successfully completed
> 04/12/2005 15:57:52;0001; pbs_mom;Job;225877.hpc-pbs.usc.edu;job not ready after 0 second timeout, MOM will recheck
> 04/12/2005 15:57:52;0008; pbs_mom;Job;scan_for_terminated;pid 1573 not tracked, exitcode=11
> 04/12/2005 15:57:52;0001; pbs_mom;Svr;pbs_mom;No child processes (10) in TMomFinalizeJob3, read of pipe for sid failed for job 225877.hpc-pbs.usc.edu (0 of 8 bytes)
> 04/12/2005 15:57:52;0001; pbs_mom;Job;TMomFinalizeJob3;start failed, improper sid
> 04/12/2005 15:57:52;0001; pbs_mom;Job;225877.hpc-pbs.usc.edu;ALERT: job failed phase 3 start, server will retry
>
> strace output...
> 24044 close(11) = 0
> 24044 munlockall() = 0
> 24044 close(16) = 0
> 24044 close(15) = 0
> 24044 --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>
>
> The munlockall() call is in fork_me() being called from TMomFinalizeJob2().
> The last 2 close() calls must be in TMomFinalizeChild(). Perhaps it is
> segfaulting inside of set_shell()?
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list