[torqueusers] job_nanny feature not working on torque 4.1.3

Lech Nieroda nieroda.lech at uni-koeln.de
Thu Nov 22 08:45:36 MST 2012


Hallo David,

many thanks! The job_nanny feature is working now, the patch fixed the 
issue.

Regards,
Lech Nieroda

On 21.11.2012 18:43, David Beer wrote:
> Lech,
>
> That is a bug. Does this patch fix the issue for you?
>
> David
>
> On Thu, Nov 15, 2012 at 7:09 AM, Lech Nieroda <nieroda.lech at uni-koeln.de
> <mailto:nieroda.lech at uni-koeln.de>> wrote:
>
>     Dear list,
>
>     we've upgraded our torque 2.5.11 to torque 4.1.3 and have run into
>     several problems. The most annoying one is the no longer working
>     job_nanny feature. We are using torque together with maui 3.3.1.
>
>     To reproduce the problem we submit a job and as soon as it is running,
>     kill the pbs_mom on the appropriate node with "kill -9" (the idea here
>     is to simulate a node crash). Some time after the walltime is exceeded
>     maui sends deletion requests to the pbs_server and a mail is spawned to
>     the user. Considering that maui does this every 60 seconds, this amounts
>     to a sizeable amount of emails.
>
>     On torque 2.5.x this was inhibited by the "job_nanny" feature: any
>     further deletion requests of maui were met with a rejection ("job cancel
>     in progress"). However, this doesn't work on torque 4.3.1 anymore. The
>     feature is set to "true" on the pbs_server but each one of maui's
>     deletion requests triggers an email.
>
>     We've tried to set "$ignwalltime true" on the clients, to no avail.
>
>     Here the relevant pbs_server logs with log_level 3:
>
>     [snip]
>     11/15/2012 13:18:20  S    Job deleted at request of
>     maui at localhost.localdomain
>     11/15/2012 13:18:20  S    preparing to send 'd' mail for job
>     670947.cheops10 to nierodal at cheops10 (Job deleted at request of
>     maui at localhost.localdomain
>     11/15/2012 13:18:20  S    Job sent signal SIGTERM on delete
>     11/15/2012 13:19:27  S    Job deleted at request of
>     maui at localhost.localdomain
>     11/15/2012 13:19:27  S    preparing to send 'd' mail for job
>     670947.cheops10 to nierodal at cheops10 (Job deleted at request of
>     maui at localhost.localdomain
>     11/15/2012 13:19:27  S    Job sent signal SIGTERM on delete
>     11/15/2012 13:20:30  S    Job deleted at request of
>     maui at localhost.localdomain
>     11/15/2012 13:20:30  S    preparing to send 'd' mail for job
>     670947.cheops10 to nierodal at cheops10 (Job deleted at request of
>     maui at localhost.localdomain
>     11/15/2012 13:20:30  S    Job sent signal SIGTERM on delete
>     [snap]
>
>     Is this a bug? Can a parameter change this behaviour?
>     So far, we'd had to disable the mail functionality.
>
>     Regards,
>     Lech Nieroda
>
>     PS: I've resent this mail since it didn't appear to have hit the list.
>
>     --
>     Dipl.-Wirt.-Inf. Lech Nieroda
>     Regionales Rechenzentrum der Universität zu Köln (RRZK)
>     Universität zu Köln
>     Weyertal 121
>     Raum 309 (3. Etage)
>     D-50931 Köln
>     Deutschland
>
>     Tel.: +49 (221) 470-89606 <tel:%2B49%20%28221%29%20470-89606>
>     E-Mail: nieroda.lech at uni-koeln.de <http://uni-koeln.de>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


-- 
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)
Universität zu Köln
Weyertal 121
Raum 309 (3. Etage)
D-50931 Köln
Deutschland

Tel.: +49 (221) 470-89606
E-Mail: nieroda.lech at uni-koeln.de


More information about the torqueusers mailing list