[torqueusers] qdel will not delete
Greenseid, Joseph M.
Joseph.Greenseid at ngc.com
Thu Dec 11 12:19:26 MST 2008
I've only seen this problem when some of the nodes allocated to the job are unresponsive (either because they've crashed, or, for instance, they're so overloaded they're functionally crippled and unresponsive). When the unresponsive node is able to be communicated with by the mom, then the job will be able to exit (unless you force it as Steve mentions below).
--Joe
________________________________
From: torqueusers-bounces at supercluster.org on behalf of Steve Young
Sent: Thu 12/11/2008 2:02 PM
To: Rahul Nabar
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] qdel will not delete
Usually when this happens qdel -p <job id> will remove the job from
the queue if a normal qdel won't do it. From the qdel man page:
-p Forcibly purge the job from the server. This
should only be used if a running job will not exit because its
allocated nodes are unreachable. The admin
should make every attempt at resolving the
problem on the nodes. If a job's mother superior recovers after
purging the job, any epilogue scripts may still
run. This option is only available to a batch
operator or the batch administrator.
Hope this helps,
-Steve
On Dec 11, 2008, at 1:47 PM, Rahul Nabar wrote:
> I've had jobs that won't respond to qdel once every so often. Their
> "REMAINING-time" on MAUI then becomes negative which was initially
> confusing since I thought it was a MAUI bug.
>
> But the root-cause seems to be that PBS will not obey the qdel on this
> job. Irrespective of whether I issue it as root or MAUI issues it.
>
> I had one such job today and I debugged it more: All the sub-nodes
> seemed to be up. the mom daemon on each one of these nodes seemed to
> be up and running.
>
> The mom_log on the master node though was interesting; It had this
> snippet:
>
> 12/11/2008 11:47:38;0002; pbs_mom;Svr;im_request;connect from
> 11.0.1.79:1023
> 12/11/2008 11:47:38;0008;
> pbs_mom;Job;233139.supernova.che.wisc.edu;received request 'KILL_JOB'
> from 11.0.1.79:1023
> 12/11/2008 11:47:38;0008;
> pbs_mom;Job;233139.supernova.che.wisc.edu;ERROR: received request
> 'KILL_JOB' from 11.0.1.79:1023 for job '233139.supernova.che.wisc.edu'
> (job does not exist locally)
>
> The only way I could get this job to delete was to restart the pbs_mom
> on that node.
>
> Anyone else who has encountered these symptoms? For me the first clue
> was a negative "REMAINING-time" on MAUI and users who complained that
> they could not qdel a job. In the past I've achieved the same effect
> by removing the relevant foo.supe.JB and foo.supe.SC files from the
> /var/spool/torque/server_priv/jobs on the master node.
> But I don't think that is the best way out. I'd appreciate any other
> debug suggestions as well.
>
> --
> Rahul
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20081211/060ce102/attachment.html
More information about the torqueusers
mailing list