[torquedev] preemption and dedicated nodes

Ken Nielson knielson at adaptivecomputing.com
Wed Jun 16 17:34:46 MDT 2010



----- Original Message -----
From: "Garrick Staples" <garrick at usc.edu>
To: torquedev at supercluster.org
Sent: Wednesday, June 16, 2010 4:49:58 PM
Subject: Re: [torquedev] preemption and dedicated nodes

(Moving this from mauiuser to torquedev)

On Thu, Jun 10, 2010 at 03:05:27PM -0700, Garrick Staples alleged:
>> Has anyone noticed that maui starts up preemptor jobs on top of preemptee jobs
>> on dedicated nodes?
>> 
>> With dedicated nodes, I expect that only one job will be running at a time. But
>> Maui is running the new jobs immediately after calling pbs_rerunjob(). The old
>> job doesn't have a chance to exit yet resulting in 2 jobs running at the same
>> time (for a short time).

>Would anyone like my fix (so far)? I also fixed a second problem in torque
>which is that preemptee jobs are just SIGKILL'd instead of given a reasonable
>SIGTERM, followed by kill_delay'd SIGKILL.

>This patch is against torque 2.1-fixes.  This fix has 2 parts. The first was to
>replace the SIGKILL with SIGTERM+job_delete_nanny that obeys kill_delay. This
>gives jobs a chance to checkpoint themselves.

>The second part is a hook near the top of req_runjob() that looks for existing
>jobs in the RERUN substate. If so, then a work task is created to delay job
>start. The caller (maui) is left hanging while this happens.


>-- 
>Garrick Staples, GNU/Linux HPCC SysAdmin
>University of Southern California

We would certainly be interested.

Ken

_______________________________________________
torquedev mailing list
torquedev at supercluster.org
http://www.supercluster.org/mailman/listinfo/torquedev


More information about the torquedev mailing list