[torqueusers] Removing processes after a job is killed

Joshua Bernstein jbernstein at penguincomputing.com
Wed Jul 2 12:12:34 MDT 2008


Prakash Velayutham wrote:
> Hi,
> 
> I am sure you must have heard of mpiexec for Torque-based Task 
> Management. mpiexec (available from www.osc.edu/~pw/mpiexec/index.php) 
> basically does the cleanup for you when you do qdel or something like that.

Absolutely,

	You should definitely be using mpiexec when running a MPICH or MVAPICH 
job under TORQUE. mpiexec uses the tm interface to spawn processes on 
remote nodes, rather then using something like SSH. The benefit of using 
the tm interface is two fold. First, the issue you describe will go away 
because suddenly the sister mom's will know which processes to kill when 
a qdel or other kill signal is received. Second, utilizations rates will 
be tracked across ALL of the nodes in the job rather then just on the 
mom superior. This is especially important when using something like 
Moab or other statistics package to track utilization or do chargeback. 
Without mpiexec, users will be getting with cycles they haven't paid for.

-Joshua Bernstein
Software Engineer
Penguin Computing



> Prakash
> 
> On Jul 2, 2008, at 1:42 PM, Craig Macdonald wrote:
> 
>> Firstly, limit each node to one job per user. Then you can use a kill 
>> in the epilogue. See below for cutdown example
>>
>> #!/bin/bash
>> jobid=$1
>> userid=$2
>>
>> ps -U $userid -o pid --no-heading | xargs -r kill -KILL
>>
>>
>> C
>>
>>
>>
>> David Sheen wrote:
>>> The parallel programming environments we use (e.g. MPICH) use SSH to
>>> create processes on the sister nodes.  If these jobs fail (are
>>> deleted, the mother node crashes, etc), the spawned processes remain
>>> on the sisters and eventually someone has to go and clean them out.
>>> Is there any way to use epilogue scripts to keep track of these
>>> processes and make sure they get killed properly if they need to be?
>>>
>>> David
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> Prakash Velayutham
> Programmer / Analyst
> Cincinnati Children's Hospital Medical Center
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list