[torqueusers] pbs_mom request, was Re: PBS_MOM kills running jobs when restarted

Garrick Staples garrick at usc.edu
Mon Dec 14 11:37:07 MST 2009


On Mon, Dec 14, 2009 at 12:23:14PM +0100, Bogdan Costescu alleged:
> > Thirdly, if a node does go bad and reboot then it
> > makes diagnosis and troubleshooting a lot easier if
> > the node has no jobs on it.
> 
> If the node is offline-d upon unexpected reboot, it would still remain
> empty and ready for testing.

That's what I do.  The init script leaves a /.autopbserror file around (just
like the autofsck mechanism).

If found on boot, the node has been ungracefully rebooted and the node is
marked offline.

I like making sure that pbs_mom is at least _started_ on boot to allow old jobs
the chance to exit.

-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Life is Good!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20091214/39b2639e/attachment.bin 


More information about the torqueusers mailing list