[torqueusers] Re: Grace time for mjobctl -R
Jan Ploski
Jan.Ploski at offis.de
Fri Nov 9 01:52:56 MST 2007
mauiusers-bounces at supercluster.org schrieb am 11/09/2007 12:55:36 AM:
> On Thu, Nov 08, 2007 at 07:08:48PM +0100, Jan Ploski alleged:
> > Hello,
> >
> > Is there a way to set a "grace time" for a job which is requeued with
> > mjobctl -R? I'd like mjobctl -R to send the job a TERM signal which
> > triggers the job to create a checkpoint. If the job does not terminate
> > for 5 minutes after that, I'd like it to be sent KILL. Right now, the
> > job is force-killed within just a few seconds.
>
> Using torque? See the kill_delay queue/server attribute.
Thanks. Yes, I'm using TORQUE 2.1.6. The attribute does not seem to help,
however: I set kill_delay = 300 on both the server and the queue,
restarted pbs_server after that. But when I qdel the following test job,
it's gone in a few seconds. Stdout shows that the TERM signal was
delivered twice (?). When I run jpl1.jb directly from shell, it catches
TERM and INT signals correctly...
jpl1.jb:
#!/bin/sh
#PBS -m n
#PBS -k n
#PBS -q verylong
#PBS -l nodes=node1
#PBS -o jpl1.OUT
#PBS -e jpl1.ERR
trap '' TERM INT
cd /home/jploski/torque
perl trap.pl
trap.pl:
#!/usr/bin/perl
use strict;
use warnings;
$| = 1;
$SIG{'TERM'} = 'TERM_handler';
$SIG{'INT'} = 'TERM_handler';
sub TERM_handler
{
print "Got TERM\n";
system("date");
$SIG{'TERM'} = 'TERM_handler';
}
sub INT_handler
{
print "Got INT\n";
system("date");
$SIG{'INT'} = 'TERM_handler';
}
for (my $i = 0; $i < 600; $i++)
{
sleep(1);
}
¨
Content of jpl1.OUT after the job is gone:
Got TERM
Fr Nov 9 09:47:25 CET 2007
Got TERM
Fr Nov 9 09:47:30 CET 2007
More information about the torqueusers
mailing list