[torqueusers] qdel will not delete

Greenseid, Joseph M. Joseph.Greenseid at ngc.com
Thu Dec 11 12:19:26 MST 2008


I've only seen this problem when some of the nodes allocated to the job are unresponsive (either because they've crashed, or, for instance, they're so overloaded they're functionally crippled and unresponsive).  When the unresponsive node is able to be communicated with by the mom, then the job will be able to exit (unless you force it as Steve mentions below).
 
--Joe

________________________________

From: torqueusers-bounces at supercluster.org on behalf of Steve Young
Sent: Thu 12/11/2008 2:02 PM
To: Rahul Nabar
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] qdel will not delete



Usually when this happens qdel -p <job id> will remove the job from 
the queue if a normal qdel won't do it. From the qdel man page:

        -p             Forcibly purge the job from the server.  This 
should only be used if a running job will not exit because its 
allocated nodes are  unreachable.   The  admin
                       should make every attempt at resolving the 
problem on the nodes.  If a job's mother superior recovers after 
purging the job, any epilogue scripts may still
                       run.  This option is only available to a batch 
operator or the batch administrator.

Hope this helps,

-Steve

On Dec 11, 2008, at 1:47 PM, Rahul Nabar wrote:

> I've had jobs that won't respond to qdel once every so often. Their
> "REMAINING-time" on MAUI then becomes negative which was initially
> confusing since I thought it was a MAUI bug.
>
> But the root-cause seems to be that PBS will not obey the qdel on this
> job. Irrespective of whether I issue it as root or MAUI issues it.
>
> I had one such job today and I debugged it more:  All the sub-nodes
> seemed to be up. the mom daemon on each one of these nodes seemed to
> be up and running.
>
> The mom_log on the master node though was interesting; It had this 
> snippet:
>
> 12/11/2008 11:47:38;0002;   pbs_mom;Svr;im_request;connect from 
> 11.0.1.79:1023
> 12/11/2008 11:47:38;0008;
> pbs_mom;Job;233139.supernova.che.wisc.edu;received request 'KILL_JOB'
> from 11.0.1.79:1023
> 12/11/2008 11:47:38;0008;
> pbs_mom;Job;233139.supernova.che.wisc.edu;ERROR:    received request
> 'KILL_JOB' from 11.0.1.79:1023 for job '233139.supernova.che.wisc.edu'
> (job does not exist locally)
>
> The only way I could get this job to delete was to restart the pbs_mom
> on that node.
>
> Anyone else who has encountered these symptoms? For me the first clue
> was a negative "REMAINING-time" on MAUI and users who complained that
> they could not qdel a job. In the past I've achieved the same effect
> by removing the relevant foo.supe.JB  and foo.supe.SC  files from the
> /var/spool/torque/server_priv/jobs on the master node.
> But I don't think that is the best way out. I'd appreciate any other
> debug suggestions as well.
>
> --
> Rahul
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081211/060ce102/attachment.html


More information about the torqueusers mailing list