[torqueusers] Moving jobs off dead nodes?

David Singleton David.Singleton at anu.edu.au
Thu Mar 17 17:46:35 MST 2005


I have an incomplete hack of req_deletejob() in the server that uses
that wonderful char *extend "freebie" in the batch_request struct to
accept a "--force" type option from qdel.  req_deletejob() then doesn't
try to talk to the MOM, it tries to jump straight into the middle of the
jobobit code.  It's close to working (did occasionally) but I still have
some bug in short-circuiting the jobobit chatter b/n server and MOM and
haven't looked at it for a couple of years.

Chris, I like your dummy node idea - you never know what you might use
that 386 in the corner for, do you :-).

David

Chris Samuel wrote:
> On Fri, 18 Mar 2005 09:38 am, Troy Baer wrote:
> 
> 
>>However, if one of the nodes is dead or not responding, this doesn't work.
> 
> 
> We developed a hack around this for when we've had a node die completely with 
> a job on it due to a hardware failure and not been able to get it back until 
> the service guys come with a replacement widget.
> 
> Unfortunately it's not a PBS hack, it's aliasing the IP address of the dead 
> node to another node, this node then answers the plaintive queries from the 
> pbs_server and tells it that (suprise suprise) it's never heard of the jobs 
> that it wants to know about and the pbs_server goes away satisfied that the 
> job state can be changed.
> 
> The only way I can think of off of the top of my head is to include a new 
> state for a node that can only be set and unset by the administrator, maybe 
> "unrecoverable" (or perhaps something easier to type) that would tell PBS to 
> not poll the node but treat it as if any jobs there had exited (presumably 
> with a non-zero exit code).
> 
> I guess the issue then is that pbs_sched, maui and moab would need to be 
> modified to accommodate this new node status.
> 
> cheers!
> Chris
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://supercluster.org/mailman/listinfo/torqueusers


-- 
--------------------------------------------------------------------------
    Dr David Singleton               ANU Supercomputer Facility
    HPC Systems Manager              and APAC National Facility
    David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
    Phone: +61 2 6125 4389           Australian National University
    Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------


More information about the torqueusers mailing list