[torqueusers] Cannot clear jobs with momctl

Sikora, Josef S josef.s.sikora at boeing.com
Mon Apr 19 09:08:51 MDT 2010


Try setting mom_job_sync on the server.  Below is the definition of the parameter:

   mom_job_sync
                 Enables  the  "job  sync  on  MOM"  feature.   When MOMs send a status
                 update, and it includes a list of jobs, server will issue job  deletes
                 for  any  jobs  that  don't  actually exist.  Format: boolean; default
                 value: true.

The stale jobs are cleared.  Other jobs on a node are not affected.  We have been using this without any problems for over a year.

Josef



________________________________
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Paco Bernabé
Sent: Monday, April 19, 2010 12:46 AM
To: torqueusers at supercluster.org
Subject: Re: [torqueusers] Cannot clear jobs with momctl

Hi everyone,

Does anyone have an alternative strategy besides Marvin's?

Cheers,
Paco.


On Apr 15, 2010, at 5:13 PM, Paco Bernabé wrote:


Hi Marvin,

I couldn't do that, as it's a heavily loaded production system, so there are jobs coming to the WNs all the time. I would need a solution that affected only the stale jobs and not the entire worker node.

Cheers,
Paco.

On Apr 15, 2010, at 3:12 PM, Marvin Novaglobal wrote:


Hi Paco,
    You're right. It is always safer to set the node to offline state before clearing all stale jobs. In my case though, I just make sure there is no registered job in the execution node at server side then I clear all the stale jobs.


Regards,
Marvin

On Wed, Apr 14, 2010 at 5:03 PM, Paco Bernabé <fbernabe at nikhef.nl<mailto:fbernabe at nikhef.nl>> wrote:
Hi Marvin,

Thanks for your reply, this actually works; but in order to execute 'momctl -h $wn -c all', I have to set to 'offline' the node in advance, so that no new jobs come into the node. Do you know possible reasons for the jobs to get stuck when the status is EXITED? Anything relevant that I could find in the log files? Is there any other strategy that doesn't require to set the node to offline?

Thanks,
Paco.


On Apr 14, 2010, at 8:01 AM, Marvin Novaglobal wrote:


Hi,
    Perhaps you can use 'pbsnodes $wn' and grep whether there is a registered job running on current compute node. Then, use 'momctl -c ALL' to clear all the stale jobs if there is no running job registered on the pbs_server side. Optionally, you can recycle the pbs_mom as well. So far, it has served us well.


Regards,
Marvin

On Tue, Apr 13, 2010 at 10:01 PM, Paco Bernabé <fbernabe at nikhef.nl<mailto:fbernabe at nikhef.nl>> wrote:
Hi,

I run daily (via cron) 'momctl -d2 -h $wn', in order to detect jobs that have got stuck. The torque server runs the version 2.3.8 of Torque/Maui under CentOS 5.4. I wrote a small script that detects those jobs that wouldn't be cleared automatically by the Torque server and clears them with 'momctl -h $wn -c $job_id'. So far I've seen that those kind of jobs have 'state' set to either PREOBIT or EXITED. In the first case (First example below) a SIGKILL signal is sent eventually by the torque server, the script detects this after running 'tracejob -n 30 -q $job_id' and clears the job via momctl, in the second case (2nd and 3rd example below) I've tried several times to clear the jobs via momctl without success.

After talking to some colleagues a solution would be to stop the mom, to remove the related files inside /var/spool/pbs/, to remove the related files in /tmp/jobdir and start the mom; but it would be great to find a better solution as the system is in production. By the way, all these jobs are not in the queue anymore, so I cannot use qdel.

So my questions are:

    1.- Is there any alternative strategy to clear the jobs, besides via momctl and mom restarting?
    2.- Are there other examples/cases where the jobs get stuck? If yes, what is the strategy to clear them?

If more information is required, please let me know.

Cheers,
Paco.


Host: <WORKER_NODE>
Version: 2.3.8
PID: 21542
Server[0]: <TORQUE_SERVER> (<IP>:15001)
Init Msgs Received: 0 hellos/1 cluster-addrs
Init Msgs Sent: 1 hellos
Last Msg From Server: 15 seconds (StatusJob)
Last Msg To Server: 22 seconds
HomeDirectory: /var/spool/pbs/mom_priv stdout/stderr spool directory: '/var/spool/pbs/spool/' (1109151 blocks available)
MOM active: 1244335 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
MemLocked: TRUE (mlock)
TCP Timeout: 20 seconds
Prolog: /var/spool/pbs/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List: <LIST>
Copy Command: /usr/bin/scp -rpB
job[3657754.<TORQUE_SERVER>] state=RUNNING sidlist=29714
job[3659174.<TORQUE_SERVER>] state=RUNNING sidlist=14994
job[3662531.<TORQUE_SERVER>] state=RUNNING sidlist=10682
job[3665186.<TORQUE_SERVER>] state=RUNNING sidlist=30058
job[3665605.<TORQUE_SERVER>] state=RUNNING sidlist=26822
job[3666248.<TORQUE_SERVER>] state=RUNNING sidlist=31058
job[3667022.<TORQUE_SERVER>] state=RUNNING sidlist=8774
job[3667269.<TORQUE_SERVER>] state=PREOBIT sidlist=
Assigned CPU Count: 8 diagnostics complete




Host: <WORKER_NODE>
Version: 2.3.8
PID: 1036
Server[0]: <TORQUE_SERVER> (<IP>:15001)
Init Msgs Received: 0 hellos/1 cluster-addrs
Init Msgs Sent: 1 hellos
WARNING: invalid attempt to connect from server 127.0.0.1:1021<http://127.0.0.1:1021/> (request corrupt)
Last Msg From Server: 66 seconds (StatusJob)
Last Msg To Server: 12 seconds
HomeDirectory: /var/spool/pbs/mom_priv stdout/stderr spool directory: '/var/spool/pbs/spool/' (1803724 blocks available)
MOM active: 2704397 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
MemLocked: TRUE (mlock)
TCP Timeout: 20 seconds
Prolog: /var/spool/pbs/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List: <LIST>
Copy Command: /usr/bin/scp -rpB
job[3968121.stro.nikhef.nl<http://3968121.stro.nikhef.nl/>] state=EXITED sidlist=
job[4433586.stro.nikhef.nl<http://4433586.stro.nikhef.nl/>] state=RUNNING sidlist=5873
job[4435775.stro.nikhef.nl<http://4435775.stro.nikhef.nl/>] state=RUNNING sidlist=19862
Assigned CPU Count: 3 diagnostics complete




Host: <WORKER_NODE>
Version: 2.3.8
PID: 8134
Server[0]: <TORQUE_SERVER> (<IP>:15001)
Init Msgs Received: 0 hellos/1 cluster-addrs
Init Msgs Sent: 1 hellos
Last Msg From Server: 90 seconds (ModifyJob)
Last Msg To Server: 38 seconds
HomeDirectory: /var/spool/pbs/mom_priv stdout/stderr spool directory: '/var/spool/pbs/spool/' (1800265 blocks available)
MOM active: 2704367 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: RPP
MemLocked: TRUE (mlock)
TCP Timeout: 20 seconds
Prolog: /var/spool/pbs/mom_priv/prologue (disabled) Alarm Time: 0 of 10 seconds
Trusted Client List: <LIST>
Copy Command: /usr/bin/scp -rpB
job[3968769.<TORQUE_SERVER>] state=EXITED sidlist=
job[4442748.<TORQUE_SERVER>] state=RUNNING sidlist=7387
job[4462579.<TORQUE_SERVER>] state=RUNNING sidlist=30248
Assigned CPU Count: 3 diagnostics complete



============================================
F.J. Bernabé Pellicer

Nikhef, Dutch National Institute for Sub-atomic Physics
Group Computer Technology
Room: H154
Phone: +31 20 592 2185
Science Park 105
1098 XG Amsterdam
The Netherlands


_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers


============================================
F.J. Bernabé Pellicer

Nikhef, Dutch National Institute for Sub-atomic Physics
Group Computer Technology
Room: H154
Phone: +31 20 592 2185
Science Park 105
1098 XG Amsterdam
The Netherlands



============================================
F.J. Bernabé Pellicer

Nikhef, Dutch National Institute for Sub-atomic Physics
Group Computer Technology
Room: H154
Phone: +31 20 592 2185
Science Park 105
1098 XG Amsterdam
The Netherlands

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
http://www.supercluster.org/mailman/listinfo/torqueusers

============================================
F.J. Bernabé Pellicer

Nikhef, Dutch National Institute for Sub-atomic Physics
Group Computer Technology
Room: H154
Phone: +31 20 592 2185
Science Park 105
1098 XG Amsterdam
The Netherlands

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100419/58634aa0/attachment-0001.html 


More information about the torqueusers mailing list