[torqueusers] qdel will not delete

Steve Young chemadm at hamilton.edu
Thu Dec 11 12:02:55 MST 2008


Usually when this happens qdel -p <job id> will remove the job from  
the queue if a normal qdel won't do it. From the qdel man page:

        -p             Forcibly purge the job from the server.  This  
should only be used if a running job will not exit because its  
allocated nodes are  unreachable.   The  admin
                       should make every attempt at resolving the  
problem on the nodes.  If a job’s mother superior recovers after  
purging the job, any epilogue scripts may still
                       run.  This option is only available to a batch  
operator or the batch administrator.

Hope this helps,

-Steve

On Dec 11, 2008, at 1:47 PM, Rahul Nabar wrote:

> I've had jobs that won't respond to qdel once every so often. Their
> "REMAINING-time" on MAUI then becomes negative which was initially
> confusing since I thought it was a MAUI bug.
>
> But the root-cause seems to be that PBS will not obey the qdel on this
> job. Irrespective of whether I issue it as root or MAUI issues it.
>
> I had one such job today and I debugged it more:  All the sub-nodes
> seemed to be up. the mom daemon on each one of these nodes seemed to
> be up and running.
>
> The mom_log on the master node though was interesting; It had this  
> snippet:
>
> 12/11/2008 11:47:38;0002;   pbs_mom;Svr;im_request;connect from  
> 11.0.1.79:1023
> 12/11/2008 11:47:38;0008;
> pbs_mom;Job;233139.supernova.che.wisc.edu;received request 'KILL_JOB'
> from 11.0.1.79:1023
> 12/11/2008 11:47:38;0008;
> pbs_mom;Job;233139.supernova.che.wisc.edu;ERROR:    received request
> 'KILL_JOB' from 11.0.1.79:1023 for job '233139.supernova.che.wisc.edu'
> (job does not exist locally)
>
> The only way I could get this job to delete was to restart the pbs_mom
> on that node.
>
> Anyone else who has encountered these symptoms? For me the first clue
> was a negative "REMAINING-time" on MAUI and users who complained that
> they could not qdel a job. In the past I've achieved the same effect
> by removing the relevant foo.supe.JB  and foo.supe.SC  files from the
> /var/spool/torque/server_priv/jobs on the master node.
> But I don't think that is the best way out. I'd appreciate any other
> debug suggestions as well.
>
> -- 
> Rahul
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list