[torquedev] TORQUE and fault tolerance/resiliency

Paul H. Hargrove PHHargrove at lbl.gov
Fri Mar 4 12:05:55 MST 2011


Torque developers,

Some of you may know me as the lead developer of BLCR (Berkeley Lab 
Checkpoint Restart).  However, that project is part of a larger effort 
known as CIFTS (Coordinated Infrastructure for Fault Tolerant Systems).  
I am writing this post as a representative of that effort.

One of the main "products" of CIFTS is known as "FTB" (Fault Tolerance 
Backplane) which is a publish-subscribe infrastructure for system 
components to share "events" related to faults and error conditions.  
This information can then be used within the system to operate "better" 
in the presence of faults.

Components in the FTB are not limited to a pre-defined set, but our 
expectations include at least the Job Scheduler (JS), Resource Manager 
(RM), MPI implementation, Global File System implementation, Numerical 
and I/O libraries linked into an application, and potentially even the 
application itself.  The users and administrators of the system are 
potential "components" as well, via monitoring scripts they might 
write.  This post is an effort to tap into this list's knowledge of Job 
Schedulers and Resource Managers.

Initially I am interested in knowing what JS and RM components could IN 
THEORY do if/when connected to the FTB.  While it would be great if this 
could evolve into collaborations to add to Torque, Maui, Moab, etc., but 
I am not looking for any development commitment, just ideas.

So, there are two questions I am seeking responses to:

1) What "events" could the JS and/or RM publish to help other components?
The example that came first to my mind was generating an event if the 
spool or log file systems are full.

2) What "events" could the JS and/or RM subscribe to in order to "behave 
better" in the presence of faults?
The example that came first to my mind was information about anything 
that was "down" that might be expressed as a job requirement - such as 
failed license servers, full global filesystem(s), downed nodes, etc..  
The response would be to not start any job that required the failed 
component(s).

I would appreciate any thoughts/feedback/questions you may have based on 
the 2 high-level questions above.

Thanks,
-Paul

-- 
Paul H. HargrovePHHargrove at lbl.gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900



More information about the torquedev mailing list