[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted
Craig West
cwest at astro.umass.edu
Tue Dec 15 07:12:15 MST 2009
I use the same method as Chris. I manually start pbs_mom on each node
after boot.
A node that is "offline" is something I have taken offline, and a node
that is down is something that has rebooted or has a health problem.
I too had problems in the past with rebooting nodes causing problems
with the queue.
I have scripts that allow me to start up pbs_mom on all the nodes from
the command line.
It wasn't worth the trouble for me to get the scripts to be robust
enough to deal with all the issues.
If a machine reboots at 2am - I'll put it back in the queue when I get
to work - after I've taken a look to see why it rebooted.
Craig.
More information about the torqueusers
mailing list