[torqueusers] Can i control if the jobs dies or not??

Garrick Staples garrick at usc.edu
Wed Aug 10 19:13:32 MDT 2005


On Wed, Aug 10, 2005 at 08:23:24AM -0300, Leandro alleged:
> Hi,
> 
> I have an very clever aplication, who can dinamicaly distribute the load 
> across the nodes allocated to run a job. If one node dies in the middle of 
> the computation, the application can go on on the other nodes, and other 
> process can get the unfinished process of the dead node to complete the 
> process.
> 
> This application is writen in Fortran and we are using MPICH. The 
> application dosen't have the need to comunicate, the processes dosen't share 
> data, so the processes are very independent.
> 
> We use mpiexec to start the process in Torque, and i can remove the "-kill" 
> parameter and the processes in the nodes will keep going, but the default 
> behavior of PBS/Torque is kill the job when a node dies. Can i change this 
> behavior? If there's no way to do tha with some kind of configuration, can 
> someone point me in the code where i can work on this?

This falls under a general class of "high availability" features that CRI
really wants to get working.  CRI wants moab to rerun/requeue jobs on node
failures, and multi-server support in MOM to use for pbs_server failover.

We've been doing a TON of work with tightening up the communications between
pbs_server and pbs_mom.  My immediate goals have been to make sure they stay in
sync and can resync as necessary.  The HA goal is there, but I don't know how
close we'll get in the next release.


At this point in time, the MOM on the execution node (MS) will always kill the
job if a sister MOM isn't replying.

MS sends IM_POLL_JOB messages to sisters.  When a sister isn't replying, MS
closes the connection with mom_comm.c:im_eof() which calls
mom_comm.c:node_bailout().  With outstanding IM_POLL_JOB messages,
node_bailout() sets "pjob->ji_nodekill = np->hn_node;" and
mom_main.c:job_over_limit() kills the job if "pjob->ji_nodekill !=
TM_ERROR_NODE".

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050810/2c4e141e/attachment.bin


More information about the torqueusers mailing list