[torquedev] TORQUE and fault tolerance/resiliency

Christopher Samuel samuel at unimelb.edu.au
Tue Mar 8 21:11:16 MST 2011

Hash: SHA1

Hi Paul,

On 05/03/11 06:05, Paul H. Hargrove wrote:

> Initially I am interested in knowing what JS and RM components could IN 
> THEORY do if/when connected to the FTB.  While it would be great if this 
> could evolve into collaborations to add to Torque, Maui, Moab, etc., but 
> I am not looking for any development commitment, just ideas.


> So, there are two questions I am seeking responses to:
> 1) What "events" could the JS and/or RM publish to help other components?
> The example that came first to my mind was generating an event if the 
> spool or log file systems are full.

1) Log file full
2) Spool full
3) Cannot talk to other pbs_mom's
4) Torque health check script failure
5) Job start failure (unrelated to users script)
6) Failure to talk to the pbs_server

> 2) What "events" could the JS and/or RM subscribe to in order to "behave 
> better" in the presence of faults?

Well there's an awful lot, but here's some off the cuff:

1) user out of disk space
2) global filesystem down
3) license server down (for jobs that are requesting licenses)
4) hardware faults (MCE's, SMART, fans, etc)
5) temperature alerts (maybe room based)
6) UPS messages

> The example that came first to my mind was information about anything 
> that was "down" that might be expressed as a job requirement - such as 
> failed license servers, full global filesystem(s), downed nodes, etc..

Well downed nodes should be noticed by Torque, and a failure of
a global filesystem would presumably affect all users and so should
stop any new jobs starting..

> The response would be to not start any job that required the failed 
> component(s).


> I would appreciate any thoughts/feedback/questions you may have based on 
> the 2 high-level questions above.

Hope this is useful!

- -- 
    Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545

Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/


More information about the torquedev mailing list