[torquedev] MOM killing itself?
Garrick Staples
garrick at usc.edu
Mon Jan 9 15:54:49 MST 2006
Has anyone seen this happen? MOMs on sister nodes are killing itself as
it sends SIGTERMs to job tasks as part of job exit. I only see it with
very short jobs. There must be a race condition on job start where new
job processes haven't been assigned a new session id yet.
01/08/2006 11:48:14;0008; pbs_mom;Job;616015.hpc-pbs.usc.edu;JOIN JOB as node 12
01/08/2006 11:48:20;0100; pbs_mom;Job;616015.hpc-pbs.usc.edu;kill_job received
01/08/2006 11:48:20;0080; pbs_mom;Job;616015.hpc-pbs.usc.edu;removing transient job directory /tmp/616015.hpc-pbs.usc.edu
01/08/2006 11:48:20;0002; pbs_mom;Svr;pbs_mom;caught signal 15: leaving jobs running, just exiting
kill_job() calls kill_task() for each task in the job.
kill_task() checks every process on the system, killing all processes
with the task's session id.
Obviously would be trivial to add a check for MOM's pid, but I'd like to
understand the nature of the race and if anyone else has seen this.
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20060109/caf03012/attachment.bin
More information about the torquedev
mailing list