[torqueusers] [Fwd: Fwd: MPI QDel Problem (RT 6690)]

Jerry Smith jdsmit at sandia.gov
Tue Jan 12 10:49:12 MST 2010


Garrick Staples wrote:
> Epilogue should be written to clean up processes, no?
>
> On Tue, Jan 12, 2010 at 09:56:01AM -0700, Ken Nielson alleged:
>   
>
>   
+1

Don't mark your nodes offline with the prologue/epilogue, do process 
cleanup and if that fails offline it.

from our epilogue.parallel ( with a shout out to garrick )

# a more thorough cleanproc (original concept from Garrick Staples)
if [ -n "$(pgrep -d, -U $PBS_USER)" ];then
    pkill -U $PBS_USER
    try=0
    count=$(pgrep -U $PBS_USER | wc -l)
    while [ $count -gt 0 -a $try -lt 5 ] ; do
        sleep 2
        pkill -9 -U $PBS_USER
        let try++
        count=$(pgrep -U $PBS_USER | wc -l)
    done

    if [ $count -gt 0 ] ; then
        msg="Epilogue: $count $PBS_USER PIDs not killed"
        # put it in syslog that this node needs attention
        echo "$msg" | logger -t takedownnode
        # attempt to mark the node offline
        /apps/torque/bin/pbsnodes -o -N "$msg" $hostname
    fi
fi
--------------------------------------------------------------------------------------------------------------------

--Jerry


>> Date: Tue, 12 Jan 2010 09:53:47 -0700 (MST)
>> From: David Beer <dbeer at adaptivecomputing.com>
>> Subject: Fwd: MPI QDel Problem (RT 6690)
>> To: Ken Nielson <knielson at adaptivecomputing.com>
>>
>>
>> ----- Forwarded Message -----
>> From: "David Beer" <dbeer at adaptivecomputing.com>
>> To: "torqueusers" <torqueusers at supercluster.org>
>> Sent: Tuesday, January 12, 2010 9:51:26 AM
>> Subject: MPI QDel Problem (RT 6690)
>>
>> I'm wondering if one of you already has experienced this problem when using MPI jobs.  If someone has experience with this, I would greatly appreciate it.  I am looking into the kill_delay variable, but I am curious if one of you has perhaps another workaround.
>>
>> Thanks,
>>
>> David Beer
>>
>> ----quoted text below----
>>
>> We experience the following: If a user kills his ParaStation MPI job via
>> qdel, apparently the following happens:
>>
>> 1. The application gets a sigterm
>> 2. The ParaStation MPI shepherd (psid) starts cleaning up all the
>> processes started via mpi exec, this might take a minute or two.
>> 3. During this, the MPI shepherd gets a sigkill via PBS before the
>> processes under its control are removed, so it can not tidy up properly.
>> 4. Orphanded MPI processes are left on the nodes.
>> 5. PBS considers the nodes as free again, however the ParaStation still
>> sees the orphaned jobs and says "no good" to the next MPI jobs, which
>> consequently crashes because of lack of resources.
>>
>> As a workaround, we've incorporated checking for orphaned processes in
>> the prologue and epilogue scripts, so we can set the nodes affected
>> offline to prevent further crashes of jobs.
>>
>> We've then tried to use the kill_delay variable with a value of 120
>> seconds to give the MPI shepherd (psid) ample time to do the cleaning
>> up. This doesn't appear to work, though, as my colleague reports:
>>
>>     
>>> Obviously kill_delay does not work as expected. Again, 28 nodes were
>>> set offline due to left-over processes in state D disappearing soon
>>> afterwards.
>>>       
>>> Looking into mother-superiors log shows:
>>>       
>>> 01/05/2010 13:41:44;0008; pbs_mom;Job;113708.jj28b01;Job Modified at
>>> request of PBS_Server at jj28b01
>>> 01/05/2010 13:42:23;0001; pbs_mom;Job;TMomFinalizeJob3;job
>>> 113708.jj28b01 started, pid = 5580
>>> 01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
>>> killing pid 5580 task 1 with sig 15
>>> 01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
>>> killing pid 6019 task 1 with sig 15
>>> 01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
>>> killing pid 6084 task 1 with sig 15
>>> 01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
>>> killing pid 6088 task 1 with sig 15
>>> 01/05/2010 14:31:50;0008; pbs_mom;Job;113708.jj28b01;kill_task:
>>> killing pid 6088 task 1 gracefully with sig 15
>>> 01/05/2010 14:31:55;0008; pbs_mom;Job;113708.jj28b01;kill_task:
>>> killing pid 6088 task 1 with sig 9
>>> 01/05/2010 14:31:55;0080;
>>> pbs_mom;Job;113708.jj28b01;scan_for_terminated: job 113708.jj28b01
>>> task 1 terminated, sid=5580
>>> 01/05/2010 14:31:55;0008; pbs_mom;Job;113708.jj28b01;job was >terminated
>>> 01/05/2010 14:32:06;0080; pbs_mom;Svr;preobit_reply;top of >preobit_reply
>>> 01/05/2010 14:32:06;0080;
>>> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked,
>>> top of while loop
>>> 01/05/2010 14:32:06;0080; pbs_mom;Svr;preobit_reply;in while loop,
>>> no error from job stat
>>> 01/05/2010 14:32:16;0008; pbs_mom;Job;113708.jj28b01;checking job
>>> post-processing routine
>>> 01/05/2010 14:32:16;0080; pbs_mom;Job;113708.jj28b01;obit sent to >server
>>>
>>>
>>> I.e.the delay between sending signal 15 and signal 9 to pid 6088 is 5
>>> seconds, not 240 as expected from Torque's configuration for all the
>>> queues. Job 113708 was running in queue hpcff which has
>>> kill_delay=240, too.
>>>
>>> To me it's unclear how terminating a job really works. Which instance
>>> is responsible for sending the SIGKILL.
>>>       
>> -- 
>> David Beer | Senior Software Engineer
>> Adaptive Computing
>>
>>
>>
>> -- 
>> David Beer | Senior Software Engineer
>> Adaptive Computing
>>
>> -- 
>> David Beer | Senior Software Engineer
>> Adaptive Computing
>>
>>     
>
>   
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>     
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100112/d628d1d9/attachment.html 


More information about the torqueusers mailing list