[torqueusers] Moving jobs off dead nodes?
David Singleton
David.Singleton at anu.edu.au
Thu Mar 17 17:46:35 MST 2005
I have an incomplete hack of req_deletejob() in the server that uses
that wonderful char *extend "freebie" in the batch_request struct to
accept a "--force" type option from qdel. req_deletejob() then doesn't
try to talk to the MOM, it tries to jump straight into the middle of the
jobobit code. It's close to working (did occasionally) but I still have
some bug in short-circuiting the jobobit chatter b/n server and MOM and
haven't looked at it for a couple of years.
Chris, I like your dummy node idea - you never know what you might use
that 386 in the corner for, do you :-).
David
Chris Samuel wrote:
> On Fri, 18 Mar 2005 09:38 am, Troy Baer wrote:
>
>
>>However, if one of the nodes is dead or not responding, this doesn't work.
>
>
> We developed a hack around this for when we've had a node die completely with
> a job on it due to a hardware failure and not been able to get it back until
> the service guys come with a replacement widget.
>
> Unfortunately it's not a PBS hack, it's aliasing the IP address of the dead
> node to another node, this node then answers the plaintive queries from the
> pbs_server and tells it that (suprise suprise) it's never heard of the jobs
> that it wants to know about and the pbs_server goes away satisfied that the
> job state can be changed.
>
> The only way I can think of off of the top of my head is to include a new
> state for a node that can only be set and unset by the administrator, maybe
> "unrecoverable" (or perhaps something easier to type) that would tell PBS to
> not poll the node but treat it as if any jobs there had exited (presumably
> with a non-zero exit code).
>
> I guess the issue then is that pbs_sched, maui and moab would need to be
> modified to accommodate this new node status.
>
> cheers!
> Chris
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers
--
--------------------------------------------------------------------------
Dr David Singleton ANU Supercomputer Facility
HPC Systems Manager and APAC National Facility
David.Singleton at anu.edu.au Leonard Huxley Bldg (No. 56)
Phone: +61 2 6125 4389 Australian National University
Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------
More information about the torqueusers
mailing list