[torqueusers] Cannot clear jobs with momctl

Paco Bernabé fbernabe at nikhef.nl
Mon Apr 19 01:45:34 MDT 2010


Hi everyone,

Does anyone have an alternative strategy besides Marvin's?

Cheers,
Paco.


On Apr 15, 2010, at 5:13 PM, Paco Bernabé wrote:

> Hi Marvin,
> 
> I couldn't do that, as it's a heavily loaded production system, so there are jobs coming to the WNs all the time. I would need a solution that affected only the stale jobs and not the entire worker node.
> 
> Cheers,
> Paco.
> 
> On Apr 15, 2010, at 3:12 PM, Marvin Novaglobal wrote:
> 
>> Hi Paco,
>>     You're right. It is always safer to set the node to offline state before clearing all stale jobs. In my case though, I just make sure there is no registered job in the execution node at server side then I clear all the stale jobs.
>> 
>> 
>> Regards,
>> Marvin
>> 
>> 
>> On Wed, Apr 14, 2010 at 5:03 PM, Paco Bernabé <fbernabe at nikhef.nl> wrote:
>> Hi Marvin,
>> 
>> Thanks for your reply, this actually works; but in order to execute 'momctl -h $wn -c all', I have to set to 'offline' the node in advance, so that no new jobs come into the node. Do you know possible reasons for the jobs to get stuck when the status is EXITED? Anything relevant that I could find in the log files? Is there any other strategy that doesn't require to set the node to offline?
>> 
>> Thanks,
>> Paco.
>> 
>> 
>> On Apr 14, 2010, at 8:01 AM, Marvin Novaglobal wrote:
>> 
>>> Hi,
>>>     Perhaps you can use 'pbsnodes $wn' and grep whether there is a registered job running on current compute node. Then, use 'momctl -c ALL' to clear all the stale jobs if there is no running job registered on the pbs_server side. Optionally, you can recycle the pbs_mom as well. So far, it has served us well.
>>> 
>>> 
>>> Regards,
>>> Marvin
>>> 
>>> 
>>> On Tue, Apr 13, 2010 at 10:01 PM, Paco Bernabé <fbernabe at nikhef.nl> wrote:
>>> Hi,
>>> 
>>> I run daily (via cron) 'momctl -d2 -h $wn', in order to detect jobs that have got stuck. The torque server runs the version 2.3.8 of Torque/Maui under CentOS 5.4. I wrote a small script that detects those jobs that wouldn't be cleared automatically by the Torque server and clears them with 'momctl -h $wn -c $job_id'. So far I've seen that those kind of jobs have 'state' set to either PREOBIT or EXITED. In the first case (First example below) a SIGKILL signal is sent eventually by the torque server, the script detects this after running 'tracejob -n 30 -q $job_id' and clears the job via momctl, in the second case (2nd and 3rd example below) I've tried several times to clear the jobs via momctl without success.
>>> 
>>> After talking to some colleagues a solution would be to stop the mom, to remove the related files inside /var/spool/pbs/, to remove the related files in /tmp/jobdir and start the mom; but it would be great to find a better solution as the system is in production. By the way, all these jobs are not in the queue anymore, so I cannot use qdel.
>>> 
>>> So my questions are:
>>> 
>>>     1.- Is there any alternative strategy to clear the jobs, besides via momctl and mom restarting?
>>>     2.- Are there other examples/cases where the jobs get stuck? If yes, what is the strategy to clear them?
>>> 
>>> If more information is required, please let me know.
>>> 
>>> Cheers,
>>> Paco.
>>> 
>>> 
>>> Host: <WORKER_NODE> 
>>> Version: 2.3.8 
>>> PID: 21542 
>>> Server[0]: <TORQUE_SERVER> (<IP>:15001) 
>>> Init Msgs Received: 0 hellos/1 cluster-addrs 
>>> Init Msgs Sent: 1 hellos 
>>> Last Msg From Server: 15 seconds (StatusJob) 
>>> Last Msg To Server: 22 seconds 
>>> HomeDirectory: /var/spool/pbs/mom_priv stdout/stderr spool directory: '/var/spool/pbs/spool/' (1109151 blocks available) 
>>> MOM active: 1244335 seconds 
>>> Check Poll Time: 45 seconds 
>>> Server Update Interval: 45 seconds 
>>> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) 
>>> Communication Model: RPP 
>>> MemLocked: TRUE (mlock) 
>>> TCP Timeout: 20 seconds 
>>> Prolog: /var/spool/pbs/mom_priv/prologue (disabled) 
>>> Alarm Time: 0 of 10 seconds 
>>> Trusted Client List: <LIST>
>>> Copy Command: /usr/bin/scp -rpB 
>>> job[3657754.<TORQUE_SERVER>] state=RUNNING sidlist=29714 
>>> job[3659174.<TORQUE_SERVER>] state=RUNNING sidlist=14994 
>>> job[3662531.<TORQUE_SERVER>] state=RUNNING sidlist=10682 
>>> job[3665186.<TORQUE_SERVER>] state=RUNNING sidlist=30058 
>>> job[3665605.<TORQUE_SERVER>] state=RUNNING sidlist=26822 
>>> job[3666248.<TORQUE_SERVER>] state=RUNNING sidlist=31058 
>>> job[3667022.<TORQUE_SERVER>] state=RUNNING sidlist=8774 
>>> job[3667269.<TORQUE_SERVER>] state=PREOBIT sidlist= 
>>> Assigned CPU Count: 8 diagnostics complete
>>> 
>>> 
>>> 
>>> 
>>> Host: <WORKER_NODE> 
>>> Version: 2.3.8 
>>> PID: 1036 
>>> Server[0]: <TORQUE_SERVER> (<IP>:15001) 
>>> Init Msgs Received: 0 hellos/1 cluster-addrs 
>>> Init Msgs Sent: 1 hellos 
>>> WARNING: invalid attempt to connect from server 127.0.0.1:1021 (request corrupt) 
>>> Last Msg From Server: 66 seconds (StatusJob) 
>>> Last Msg To Server: 12 seconds 
>>> HomeDirectory: /var/spool/pbs/mom_priv stdout/stderr spool directory: '/var/spool/pbs/spool/' (1803724 blocks available) 
>>> MOM active: 2704397 seconds 
>>> Check Poll Time: 45 seconds 
>>> Server Update Interval: 45 seconds 
>>> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) 
>>> Communication Model: RPP 
>>> MemLocked: TRUE (mlock) 
>>> TCP Timeout: 20 seconds 
>>> Prolog: /var/spool/pbs/mom_priv/prologue (disabled) 
>>> Alarm Time: 0 of 10 seconds 
>>> Trusted Client List: <LIST>
>>> Copy Command: /usr/bin/scp -rpB 
>>> job[3968121.stro.nikhef.nl] state=EXITED sidlist= 
>>> job[4433586.stro.nikhef.nl] state=RUNNING sidlist=5873 
>>> job[4435775.stro.nikhef.nl] state=RUNNING sidlist=19862 
>>> Assigned CPU Count: 3 diagnostics complete
>>> 
>>> 
>>> 
>>> 
>>> Host: <WORKER_NODE> 
>>> Version: 2.3.8 
>>> PID: 8134 
>>> Server[0]: <TORQUE_SERVER> (<IP>:15001) 
>>> Init Msgs Received: 0 hellos/1 cluster-addrs 
>>> Init Msgs Sent: 1 hellos 
>>> Last Msg From Server: 90 seconds (ModifyJob) 
>>> Last Msg To Server: 38 seconds 
>>> HomeDirectory: /var/spool/pbs/mom_priv stdout/stderr spool directory: '/var/spool/pbs/spool/' (1800265 blocks available) 
>>> MOM active: 2704367 seconds 
>>> Check Poll Time: 45 seconds 
>>> Server Update Interval: 45 seconds 
>>> LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) 
>>> Communication Model: RPP 
>>> MemLocked: TRUE (mlock) 
>>> TCP Timeout: 20 seconds 
>>> Prolog: /var/spool/pbs/mom_priv/prologue (disabled) Alarm Time: 0 of 10 seconds 
>>> Trusted Client List: <LIST>
>>> Copy Command: /usr/bin/scp -rpB 
>>> job[3968769.<TORQUE_SERVER>] state=EXITED sidlist= 
>>> job[4442748.<TORQUE_SERVER>] state=RUNNING sidlist=7387 
>>> job[4462579.<TORQUE_SERVER>] state=RUNNING sidlist=30248 
>>> Assigned CPU Count: 3 diagnostics complete
>>> 
>>> 
>>> 
>>> ============================================
>>> F.J. Bernabé Pellicer
>>> 
>>> Nikhef, Dutch National Institute for Sub-atomic Physics
>>> Group Computer Technology
>>> Room: H154
>>> Phone: +31 20 592 2185
>>> Science Park 105
>>> 1098 XG Amsterdam
>>> The Netherlands
>>> 
>>> 
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>> 
>>> 
>> 
>> ============================================
>> F.J. Bernabé Pellicer
>> 
>> Nikhef, Dutch National Institute for Sub-atomic Physics
>> Group Computer Technology
>> Room: H154
>> Phone: +31 20 592 2185
>> Science Park 105
>> 1098 XG Amsterdam
>> The Netherlands
>> 
>> 
> 
> ============================================
> F.J. Bernabé Pellicer
> 
> Nikhef, Dutch National Institute for Sub-atomic Physics
> Group Computer Technology
> Room: H154
> Phone: +31 20 592 2185
> Science Park 105
> 1098 XG Amsterdam
> The Netherlands
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

============================================
F.J. Bernabé Pellicer

Nikhef, Dutch National Institute for Sub-atomic Physics
Group Computer Technology
Room: H154
Phone: +31 20 592 2185
Science Park 105
1098 XG Amsterdam
The Netherlands

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100419/92c277f5/attachment-0001.html 


More information about the torqueusers mailing list