[torqueusers] Can i control if the jobs dies or not??
Garrick Staples
garrick at usc.edu
Wed Aug 10 19:13:32 MDT 2005
On Wed, Aug 10, 2005 at 08:23:24AM -0300, Leandro alleged:
> Hi,
>
> I have an very clever aplication, who can dinamicaly distribute the load
> across the nodes allocated to run a job. If one node dies in the middle of
> the computation, the application can go on on the other nodes, and other
> process can get the unfinished process of the dead node to complete the
> process.
>
> This application is writen in Fortran and we are using MPICH. The
> application dosen't have the need to comunicate, the processes dosen't share
> data, so the processes are very independent.
>
> We use mpiexec to start the process in Torque, and i can remove the "-kill"
> parameter and the processes in the nodes will keep going, but the default
> behavior of PBS/Torque is kill the job when a node dies. Can i change this
> behavior? If there's no way to do tha with some kind of configuration, can
> someone point me in the code where i can work on this?
This falls under a general class of "high availability" features that CRI
really wants to get working. CRI wants moab to rerun/requeue jobs on node
failures, and multi-server support in MOM to use for pbs_server failover.
We've been doing a TON of work with tightening up the communications between
pbs_server and pbs_mom. My immediate goals have been to make sure they stay in
sync and can resync as necessary. The HA goal is there, but I don't know how
close we'll get in the next release.
At this point in time, the MOM on the execution node (MS) will always kill the
job if a sister MOM isn't replying.
MS sends IM_POLL_JOB messages to sisters. When a sister isn't replying, MS
closes the connection with mom_comm.c:im_eof() which calls
mom_comm.c:node_bailout(). With outstanding IM_POLL_JOB messages,
node_bailout() sets "pjob->ji_nodekill = np->hn_node;" and
mom_main.c:job_over_limit() kills the job if "pjob->ji_nodekill !=
TM_ERROR_NODE".
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050810/2c4e141e/attachment.bin
More information about the torqueusers
mailing list