[torqueusers] qdel will not delete
Steve Young
chemadm at hamilton.edu
Thu Dec 11 12:02:55 MST 2008
Usually when this happens qdel -p <job id> will remove the job from
the queue if a normal qdel won't do it. From the qdel man page:
-p Forcibly purge the job from the server. This
should only be used if a running job will not exit because its
allocated nodes are unreachable. The admin
should make every attempt at resolving the
problem on the nodes. If a job’s mother superior recovers after
purging the job, any epilogue scripts may still
run. This option is only available to a batch
operator or the batch administrator.
Hope this helps,
-Steve
On Dec 11, 2008, at 1:47 PM, Rahul Nabar wrote:
> I've had jobs that won't respond to qdel once every so often. Their
> "REMAINING-time" on MAUI then becomes negative which was initially
> confusing since I thought it was a MAUI bug.
>
> But the root-cause seems to be that PBS will not obey the qdel on this
> job. Irrespective of whether I issue it as root or MAUI issues it.
>
> I had one such job today and I debugged it more: All the sub-nodes
> seemed to be up. the mom daemon on each one of these nodes seemed to
> be up and running.
>
> The mom_log on the master node though was interesting; It had this
> snippet:
>
> 12/11/2008 11:47:38;0002; pbs_mom;Svr;im_request;connect from
> 11.0.1.79:1023
> 12/11/2008 11:47:38;0008;
> pbs_mom;Job;233139.supernova.che.wisc.edu;received request 'KILL_JOB'
> from 11.0.1.79:1023
> 12/11/2008 11:47:38;0008;
> pbs_mom;Job;233139.supernova.che.wisc.edu;ERROR: received request
> 'KILL_JOB' from 11.0.1.79:1023 for job '233139.supernova.che.wisc.edu'
> (job does not exist locally)
>
> The only way I could get this job to delete was to restart the pbs_mom
> on that node.
>
> Anyone else who has encountered these symptoms? For me the first clue
> was a negative "REMAINING-time" on MAUI and users who complained that
> they could not qdel a job. In the past I've achieved the same effect
> by removing the relevant foo.supe.JB and foo.supe.SC files from the
> /var/spool/torque/server_priv/jobs on the master node.
> But I don't think that is the best way out. I'd appreciate any other
> debug suggestions as well.
>
> --
> Rahul
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list