[torqueusers] job_nanny feature not working on torque 4.1.3

David Beer dbeer at adaptivecomputing.com
Wed Nov 21 10:43:08 MST 2012


Lech,

That is a bug. Does this patch fix the issue for you?

David

On Thu, Nov 15, 2012 at 7:09 AM, Lech Nieroda <nieroda.lech at uni-koeln.de>wrote:

> Dear list,
>
> we've upgraded our torque 2.5.11 to torque 4.1.3 and have run into
> several problems. The most annoying one is the no longer working
> job_nanny feature. We are using torque together with maui 3.3.1.
>
> To reproduce the problem we submit a job and as soon as it is running,
> kill the pbs_mom on the appropriate node with "kill -9" (the idea here
> is to simulate a node crash). Some time after the walltime is exceeded
> maui sends deletion requests to the pbs_server and a mail is spawned to
> the user. Considering that maui does this every 60 seconds, this amounts
> to a sizeable amount of emails.
>
> On torque 2.5.x this was inhibited by the "job_nanny" feature: any
> further deletion requests of maui were met with a rejection ("job cancel
> in progress"). However, this doesn't work on torque 4.3.1 anymore. The
> feature is set to "true" on the pbs_server but each one of maui's
> deletion requests triggers an email.
>
> We've tried to set "$ignwalltime true" on the clients, to no avail.
>
> Here the relevant pbs_server logs with log_level 3:
>
> [snip]
> 11/15/2012 13:18:20  S    Job deleted at request of
> maui at localhost.localdomain
> 11/15/2012 13:18:20  S    preparing to send 'd' mail for job
> 670947.cheops10 to nierodal at cheops10 (Job deleted at request of
> maui at localhost.localdomain
> 11/15/2012 13:18:20  S    Job sent signal SIGTERM on delete
> 11/15/2012 13:19:27  S    Job deleted at request of
> maui at localhost.localdomain
> 11/15/2012 13:19:27  S    preparing to send 'd' mail for job
> 670947.cheops10 to nierodal at cheops10 (Job deleted at request of
> maui at localhost.localdomain
> 11/15/2012 13:19:27  S    Job sent signal SIGTERM on delete
> 11/15/2012 13:20:30  S    Job deleted at request of
> maui at localhost.localdomain
> 11/15/2012 13:20:30  S    preparing to send 'd' mail for job
> 670947.cheops10 to nierodal at cheops10 (Job deleted at request of
> maui at localhost.localdomain
> 11/15/2012 13:20:30  S    Job sent signal SIGTERM on delete
> [snap]
>
> Is this a bug? Can a parameter change this behaviour?
> So far, we'd had to disable the mail functionality.
>
> Regards,
> Lech Nieroda
>
> PS: I've resent this mail since it didn't appear to have hit the list.
>
> --
> Dipl.-Wirt.-Inf. Lech Nieroda
> Regionales Rechenzentrum der Universität zu Köln (RRZK)
> Universität zu Köln
> Weyertal 121
> Raum 309 (3. Etage)
> D-50931 Köln
> Deutschland
>
> Tel.: +49 (221) 470-89606
> E-Mail: nieroda.lech at uni-koeln.de
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121121/88f77c48/attachment.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nanny.patch
Type: application/octet-stream
Size: 407 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121121/88f77c48/attachment.obj 


More information about the torqueusers mailing list