[torquedev] MOM killing itself?

Garrick Staples garrick at usc.edu
Mon Jan 9 15:54:49 MST 2006


Has anyone seen this happen?  MOMs on sister nodes are killing itself as
it sends SIGTERMs to job tasks as part of job exit.  I only see it with
very short jobs.  There must be a race condition on job start where new
job processes haven't been assigned a new session id yet.

01/08/2006 11:48:14;0008;   pbs_mom;Job;616015.hpc-pbs.usc.edu;JOIN JOB as node 12
01/08/2006 11:48:20;0100;   pbs_mom;Job;616015.hpc-pbs.usc.edu;kill_job received
01/08/2006 11:48:20;0080;   pbs_mom;Job;616015.hpc-pbs.usc.edu;removing transient job directory /tmp/616015.hpc-pbs.usc.edu
01/08/2006 11:48:20;0002;   pbs_mom;Svr;pbs_mom;caught signal 15: leaving jobs running, just exiting

kill_job() calls kill_task() for each task in the job.

kill_task() checks every process on the system, killing all processes
with the task's session id.

Obviously would be trivial to add a check for MOM's pid, but I'd like to
understand the nature of the race and if anyone else has seen this.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20060109/caf03012/attachment.bin


More information about the torquedev mailing list