[torqueusers] Re: Grace time for mjobctl -R

Jan Ploski Jan.Ploski at offis.de
Fri Nov 9 01:52:56 MST 2007


mauiusers-bounces at supercluster.org schrieb am 11/09/2007 12:55:36 AM:

> On Thu, Nov 08, 2007 at 07:08:48PM +0100, Jan Ploski alleged:
> > Hello,
> > 
> > Is there a way to set a "grace time" for a job which is requeued with 
> > mjobctl -R? I'd like mjobctl -R to send the job a TERM signal which 
> > triggers the job to create a checkpoint. If the job does not terminate 

> > for 5 minutes after that, I'd like it to be sent KILL. Right now, the 
> > job is force-killed within just a few seconds.
> 
> Using torque?  See the kill_delay queue/server attribute.

Thanks. Yes, I'm using TORQUE 2.1.6. The attribute does not seem to help, 
however: I set kill_delay = 300 on both the server and the queue, 
restarted pbs_server after that. But when I qdel the following test job, 
it's gone in a few seconds. Stdout shows that the TERM signal was 
delivered twice (?). When I run jpl1.jb directly from shell, it catches 
TERM and INT signals correctly...

jpl1.jb:

#!/bin/sh
#PBS -m n
#PBS -k n
#PBS -q verylong
#PBS -l nodes=node1
#PBS -o jpl1.OUT
#PBS -e jpl1.ERR

trap '' TERM INT
cd /home/jploski/torque
perl trap.pl


trap.pl:

#!/usr/bin/perl

use strict;
use warnings;

$| = 1;

$SIG{'TERM'} = 'TERM_handler';
$SIG{'INT'} = 'TERM_handler';

sub TERM_handler
{
        print "Got TERM\n";
        system("date");
        $SIG{'TERM'} = 'TERM_handler';
}

sub INT_handler
{
        print "Got INT\n";
        system("date");
        $SIG{'INT'} = 'TERM_handler';
}

for (my $i = 0; $i < 600; $i++)
{
        sleep(1);
}
¨

Content of jpl1.OUT after the job is gone:

Got TERM
Fr Nov  9 09:47:25 CET 2007
Got TERM
Fr Nov  9 09:47:30 CET 2007


More information about the torqueusers mailing list