[torqueusers] Cannot clear jobs with momctl

Paco Bernabé fbernabe at nikhef.nl
Tue Apr 13 08:01:02 MDT 2010


Hi,

I run daily (via cron) 'momctl -d2 -h $wn', in order to detect jobs that have got stuck. The torque server runs the version 2.3.8 of Torque/Maui under CentOS 5.4. I wrote a small script that detects those jobs that wouldn't be cleared automatically by the Torque server and clears them with 'momctl -h $wn -c $job_id'. So far I've seen that those kind of jobs have 'state' set to either PREOBIT or EXITED. In the first case (First example below) a SIGKILL signal is sent eventually by the torque server, the script detects this after running 'tracejob -n 30 -q $job_id' and clears the job via momctl, in the second case (2nd and 3rd example below) I've tried several times to clear the jobs via momctl without success.

After talking to some colleagues a solution would be to stop the mom, to remove the related files inside /var/spool/pbs/, to remove the related files in /tmp/jobdir and start the mom; but it would be great to find a better solution as the system is in production. By the way, all these jobs are not in the queue anymore, so I cannot use qdel.

So my questions are:

    1.- Is there any alternative strategy to clear the jobs, besides via momctl and mom restarting?
    2.- Are there other examples/cases where the jobs get stuck? If yes, what is the strategy to clear them?

If more information is required, please let me know.

Cheers,
Paco.


Host: <WORKER_NODE> 
Version: 2.3.8 
PID: 21542 
Server[0]: <TORQUE_SERVER> (<IP>:15001) 
Init Msgs Received: 0 hellos/1 cluster-addrs 
Init Msgs Sent: 1 hellos 
Last Msg From Server: 15 seconds (StatusJob) 
Last Msg To Server: 22 seconds 
HomeDirectory: /var/spool/pbs/mom_priv stdout/stderr spool directory: '/var/spool/pbs/spool/' (1109151 blocks available) 
MOM active: 1244335 seconds 
Check Poll Time: 45 seconds 
Server Update Interval: 45 seconds 
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) 
Communication Model: RPP 
MemLocked: TRUE (mlock) 
TCP Timeout: 20 seconds 
Prolog: /var/spool/pbs/mom_priv/prologue (disabled) 
Alarm Time: 0 of 10 seconds 
Trusted Client List: <LIST>
Copy Command: /usr/bin/scp -rpB 
job[3657754.<TORQUE_SERVER>] state=RUNNING sidlist=29714 
job[3659174.<TORQUE_SERVER>] state=RUNNING sidlist=14994 
job[3662531.<TORQUE_SERVER>] state=RUNNING sidlist=10682 
job[3665186.<TORQUE_SERVER>] state=RUNNING sidlist=30058 
job[3665605.<TORQUE_SERVER>] state=RUNNING sidlist=26822 
job[3666248.<TORQUE_SERVER>] state=RUNNING sidlist=31058 
job[3667022.<TORQUE_SERVER>] state=RUNNING sidlist=8774 
job[3667269.<TORQUE_SERVER>] state=PREOBIT sidlist= 
Assigned CPU Count: 8 diagnostics complete




Host: <WORKER_NODE> 
Version: 2.3.8 
PID: 1036 
Server[0]: <TORQUE_SERVER> (<IP>:15001) 
Init Msgs Received: 0 hellos/1 cluster-addrs 
Init Msgs Sent: 1 hellos 
WARNING: invalid attempt to connect from server 127.0.0.1:1021 (request corrupt) 
Last Msg From Server: 66 seconds (StatusJob) 
Last Msg To Server: 12 seconds 
HomeDirectory: /var/spool/pbs/mom_priv stdout/stderr spool directory: '/var/spool/pbs/spool/' (1803724 blocks available) 
MOM active: 2704397 seconds 
Check Poll Time: 45 seconds 
Server Update Interval: 45 seconds 
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) 
Communication Model: RPP 
MemLocked: TRUE (mlock) 
TCP Timeout: 20 seconds 
Prolog: /var/spool/pbs/mom_priv/prologue (disabled) 
Alarm Time: 0 of 10 seconds 
Trusted Client List: <LIST>
Copy Command: /usr/bin/scp -rpB 
job[3968121.stro.nikhef.nl] state=EXITED sidlist= 
job[4433586.stro.nikhef.nl] state=RUNNING sidlist=5873 
job[4435775.stro.nikhef.nl] state=RUNNING sidlist=19862 
Assigned CPU Count: 3 diagnostics complete




Host: <WORKER_NODE> 
Version: 2.3.8 
PID: 8134 
Server[0]: <TORQUE_SERVER> (<IP>:15001) 
Init Msgs Received: 0 hellos/1 cluster-addrs 
Init Msgs Sent: 1 hellos 
Last Msg From Server: 90 seconds (ModifyJob) 
Last Msg To Server: 38 seconds 
HomeDirectory: /var/spool/pbs/mom_priv stdout/stderr spool directory: '/var/spool/pbs/spool/' (1800265 blocks available) 
MOM active: 2704367 seconds 
Check Poll Time: 45 seconds 
Server Update Interval: 45 seconds 
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust) 
Communication Model: RPP 
MemLocked: TRUE (mlock) 
TCP Timeout: 20 seconds 
Prolog: /var/spool/pbs/mom_priv/prologue (disabled) Alarm Time: 0 of 10 seconds 
Trusted Client List: <LIST>
Copy Command: /usr/bin/scp -rpB 
job[3968769.<TORQUE_SERVER>] state=EXITED sidlist= 
job[4442748.<TORQUE_SERVER>] state=RUNNING sidlist=7387 
job[4462579.<TORQUE_SERVER>] state=RUNNING sidlist=30248 
Assigned CPU Count: 3 diagnostics complete



============================================
F.J. Bernabé Pellicer

Nikhef, Dutch National Institute for Sub-atomic Physics
Group Computer Technology
Room: H154
Phone: +31 20 592 2185
Science Park 105
1098 XG Amsterdam
The Netherlands

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100413/a68133c7/attachment-0001.html 


More information about the torqueusers mailing list