[torqueusers] job_nanny feature not working on torque 4.1.3

Lech Nieroda nieroda.lech at uni-koeln.de
Thu Nov 15 07:09:11 MST 2012


Dear list,

we've upgraded our torque 2.5.11 to torque 4.1.3 and have run into
several problems. The most annoying one is the no longer working
job_nanny feature. We are using torque together with maui 3.3.1.

To reproduce the problem we submit a job and as soon as it is running,
kill the pbs_mom on the appropriate node with "kill -9" (the idea here
is to simulate a node crash). Some time after the walltime is exceeded
maui sends deletion requests to the pbs_server and a mail is spawned to
the user. Considering that maui does this every 60 seconds, this amounts
to a sizeable amount of emails.

On torque 2.5.x this was inhibited by the "job_nanny" feature: any
further deletion requests of maui were met with a rejection ("job cancel
in progress"). However, this doesn't work on torque 4.3.1 anymore. The
feature is set to "true" on the pbs_server but each one of maui's
deletion requests triggers an email.

We've tried to set "$ignwalltime true" on the clients, to no avail.

Here the relevant pbs_server logs with log_level 3:

[snip]
11/15/2012 13:18:20  S    Job deleted at request of 
maui at localhost.localdomain
11/15/2012 13:18:20  S    preparing to send 'd' mail for job
670947.cheops10 to nierodal at cheops10 (Job deleted at request of
maui at localhost.localdomain
11/15/2012 13:18:20  S    Job sent signal SIGTERM on delete
11/15/2012 13:19:27  S    Job deleted at request of
maui at localhost.localdomain
11/15/2012 13:19:27  S    preparing to send 'd' mail for job
670947.cheops10 to nierodal at cheops10 (Job deleted at request of
maui at localhost.localdomain
11/15/2012 13:19:27  S    Job sent signal SIGTERM on delete
11/15/2012 13:20:30  S    Job deleted at request of
maui at localhost.localdomain
11/15/2012 13:20:30  S    preparing to send 'd' mail for job
670947.cheops10 to nierodal at cheops10 (Job deleted at request of
maui at localhost.localdomain
11/15/2012 13:20:30  S    Job sent signal SIGTERM on delete
[snap]

Is this a bug? Can a parameter change this behaviour?
So far, we'd had to disable the mail functionality.

Regards,
Lech Nieroda

PS: I've resent this mail since it didn't appear to have hit the list.

-- 
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)
Universität zu Köln
Weyertal 121
Raum 309 (3. Etage)
D-50931 Köln
Deutschland

Tel.: +49 (221) 470-89606
E-Mail: nieroda.lech at uni-koeln.de


More information about the torqueusers mailing list