[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Craig West cwest at astro.umass.edu
Tue Dec 15 07:12:15 MST 2009


I use the same method as Chris. I manually start pbs_mom on each node 
after boot.
A node that is "offline" is something I have taken offline, and a node 
that is down is something that has rebooted or has a health problem.

I too had problems in the past with rebooting nodes causing problems 
with the queue.
I have scripts that allow me to start up pbs_mom on all the nodes from 
the command line.

It wasn't worth the trouble for me to get the scripts to be robust 
enough to deal with all the issues.
If a machine reboots at 2am - I'll put it back in the queue when I get 
to work - after I've taken a look to see why it rebooted.


Craig.


More information about the torqueusers mailing list