[torqueusers] Moving jobs off dead nodes?

Chris Samuel csamuel at vpac.org
Thu Mar 17 17:02:49 MST 2005


On Fri, 18 Mar 2005 09:38 am, Troy Baer wrote:

> However, if one of the nodes is dead or not responding, this doesn't work.

We developed a hack around this for when we've had a node die completely with 
a job on it due to a hardware failure and not been able to get it back until 
the service guys come with a replacement widget.

Unfortunately it's not a PBS hack, it's aliasing the IP address of the dead 
node to another node, this node then answers the plaintive queries from the 
pbs_server and tells it that (suprise suprise) it's never heard of the jobs 
that it wants to know about and the pbs_server goes away satisfied that the 
job state can be changed.

The only way I can think of off of the top of my head is to include a new 
state for a node that can only be set and unset by the administrator, maybe 
"unrecoverable" (or perhaps something easier to type) that would tell PBS to 
not poll the node but treat it as if any jobs there had exited (presumably 
with a non-zero exit code).

I guess the issue then is that pbs_sched, maui and moab would need to be 
modified to accommodate this new node status.

cheers!
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050318/a696b343/attachment.bin


More information about the torqueusers mailing list