[torqueusers] Removing processes after a job is killed
Joshua Bernstein
jbernstein at penguincomputing.com
Wed Jul 2 12:12:34 MDT 2008
Prakash Velayutham wrote:
> Hi,
>
> I am sure you must have heard of mpiexec for Torque-based Task
> Management. mpiexec (available from www.osc.edu/~pw/mpiexec/index.php)
> basically does the cleanup for you when you do qdel or something like that.
Absolutely,
You should definitely be using mpiexec when running a MPICH or MVAPICH
job under TORQUE. mpiexec uses the tm interface to spawn processes on
remote nodes, rather then using something like SSH. The benefit of using
the tm interface is two fold. First, the issue you describe will go away
because suddenly the sister mom's will know which processes to kill when
a qdel or other kill signal is received. Second, utilizations rates will
be tracked across ALL of the nodes in the job rather then just on the
mom superior. This is especially important when using something like
Moab or other statistics package to track utilization or do chargeback.
Without mpiexec, users will be getting with cycles they haven't paid for.
-Joshua Bernstein
Software Engineer
Penguin Computing
> Prakash
>
> On Jul 2, 2008, at 1:42 PM, Craig Macdonald wrote:
>
>> Firstly, limit each node to one job per user. Then you can use a kill
>> in the epilogue. See below for cutdown example
>>
>> #!/bin/bash
>> jobid=$1
>> userid=$2
>>
>> ps -U $userid -o pid --no-heading | xargs -r kill -KILL
>>
>>
>> C
>>
>>
>>
>> David Sheen wrote:
>>> The parallel programming environments we use (e.g. MPICH) use SSH to
>>> create processes on the sister nodes. If these jobs fail (are
>>> deleted, the mother node crashes, etc), the spawned processes remain
>>> on the sisters and eventually someone has to go and clean them out.
>>> Is there any way to use epilogue scripts to keep track of these
>>> processes and make sure they get killed properly if they need to be?
>>>
>>> David
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> Prakash Velayutham
> Programmer / Analyst
> Cincinnati Children's Hospital Medical Center
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list