[torqueusers] Moving jobs off dead nodes?
Chris Samuel
csamuel at vpac.org
Thu Mar 17 17:02:49 MST 2005
On Fri, 18 Mar 2005 09:38 am, Troy Baer wrote:
> However, if one of the nodes is dead or not responding, this doesn't work.
We developed a hack around this for when we've had a node die completely with
a job on it due to a hardware failure and not been able to get it back until
the service guys come with a replacement widget.
Unfortunately it's not a PBS hack, it's aliasing the IP address of the dead
node to another node, this node then answers the plaintive queries from the
pbs_server and tells it that (suprise suprise) it's never heard of the jobs
that it wants to know about and the pbs_server goes away satisfied that the
job state can be changed.
The only way I can think of off of the top of my head is to include a new
state for a node that can only be set and unset by the administrator, maybe
"unrecoverable" (or perhaps something easier to type) that would tell PBS to
not poll the node but treat it as if any jobs there had exited (presumably
with a non-zero exit code).
I guess the issue then is that pbs_sched, maui and moab would need to be
modified to accommodate this new node status.
cheers!
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050318/a696b343/attachment.bin
More information about the torqueusers
mailing list