[torqueusers] [Fwd: Fwd: MPI QDel Problem (RT 6690)]

Garrick Staples garrick at usc.edu
Tue Jan 12 10:42:29 MST 2010


Epilogue should be written to clean up processes, no?

On Tue, Jan 12, 2010 at 09:56:01AM -0700, Ken Nielson alleged:
> 

> Date: Tue, 12 Jan 2010 09:53:47 -0700 (MST)
> From: David Beer <dbeer at adaptivecomputing.com>
> Subject: Fwd: MPI QDel Problem (RT 6690)
> To: Ken Nielson <knielson at adaptivecomputing.com>
> 
> 
> ----- Forwarded Message -----
> From: "David Beer" <dbeer at adaptivecomputing.com>
> To: "torqueusers" <torqueusers at supercluster.org>
> Sent: Tuesday, January 12, 2010 9:51:26 AM
> Subject: MPI QDel Problem (RT 6690)
> 
> I'm wondering if one of you already has experienced this problem when using MPI jobs.  If someone has experience with this, I would greatly appreciate it.  I am looking into the kill_delay variable, but I am curious if one of you has perhaps another workaround.
> 
> Thanks,
> 
> David Beer
> 
> ----quoted text below----
> 
> We experience the following: If a user kills his ParaStation MPI job via
> qdel, apparently the following happens:
> 
> 1. The application gets a sigterm
> 2. The ParaStation MPI shepherd (psid) starts cleaning up all the
> processes started via mpi exec, this might take a minute or two.
> 3. During this, the MPI shepherd gets a sigkill via PBS before the
> processes under its control are removed, so it can not tidy up properly.
> 4. Orphanded MPI processes are left on the nodes.
> 5. PBS considers the nodes as free again, however the ParaStation still
> sees the orphaned jobs and says "no good" to the next MPI jobs, which
> consequently crashes because of lack of resources.
> 
> As a workaround, we've incorporated checking for orphaned processes in
> the prologue and epilogue scripts, so we can set the nodes affected
> offline to prevent further crashes of jobs.
> 
> We've then tried to use the kill_delay variable with a value of 120
> seconds to give the MPI shepherd (psid) ample time to do the cleaning
> up. This doesn't appear to work, though, as my colleague reports:
> 
> >Obviously kill_delay does not work as expected. Again, 28 nodes were
> >set offline due to left-over processes in state D disappearing soon
> >afterwards.
> 
> >Looking into mother-superiors log shows:
> 
> >01/05/2010 13:41:44;0008; pbs_mom;Job;113708.jj28b01;Job Modified at
> >request of PBS_Server at jj28b01
> >01/05/2010 13:42:23;0001; pbs_mom;Job;TMomFinalizeJob3;job
> >113708.jj28b01 started, pid = 5580
> >01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
> >killing pid 5580 task 1 with sig 15
> >01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
> >killing pid 6019 task 1 with sig 15
> >01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
> >killing pid 6084 task 1 with sig 15
> >01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
> >killing pid 6088 task 1 with sig 15
> >01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
> >killing pid 6088 task 1 gracefully with sig 15
> >01/05/2010 14:31:55;0008; pbs_mom;Job;113708.jj28b01;kill_task:
> >killing pid 6088 task 1 with sig 9
> >01/05/2010 14:31:55;0080;
> >pbs_mom;Job;113708.jj28b01;scan_for_terminated: job 113708.jj28b01
> >task 1 terminated, sid=5580
> >01/05/2010 14:31:55;0008; pbs_mom;Job;113708.jj28b01;job was >terminated
> >01/05/2010 14:32:06;0080; pbs_mom;Svr;preobit_reply;top of >preobit_reply
> >01/05/2010 14:32:06;0080;
> >pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
> >top of while loop
> >01/05/2010 14:32:06;0080; pbs_mom;Svr;preobit_reply;in while loop,
> >no error from job stat
> >01/05/2010 14:32:16;0008; pbs_mom;Job;113708.jj28b01;checking job
> >post-processing routine
> >01/05/2010 14:32:16;0080; pbs_mom;Job;113708.jj28b01;obit sent to >server
> >
> >
> >I.e.the delay between sending signal 15 and signal 9 to pid 6088 is 5
> >seconds, not 240 as expected from Torque's configuration for all the
> >queues. Job 113708 was running in queue hpcff which has
> >kill_delay=240, too.
> >
> >To me it's unclear how terminating a job really works. Which instance
> >is responsible for sending the SIGKILL.
> 
> -- 
> David Beer | Senior Software Engineer
> Adaptive Computing
> 
> 
> 
> -- 
> David Beer | Senior Software Engineer
> Adaptive Computing
> 
> -- 
> David Beer | Senior Software Engineer
> Adaptive Computing
> 

> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20100112/db5a999a/attachment.bin 


More information about the torqueusers mailing list