[torquedev] pbs_demux

Garrick Staples garrick at usc.edu
Mon Mar 6 12:19:58 MST 2006


On Mon, Mar 06, 2006 at 02:11:11PM -0500, Prakash Velayutham alleged:
> Garrick Staples wrote:
> >On Fri, Mar 03, 2006 at 10:29:53AM -0500, Prakash Velayutham alleged:
> >  
> >>Hi All,
> >>
> >>This question is regarding a multi-node MPI kind of job. After the job 
> >>is scheduled and sent over to the MS, MS first does a JOIN_JOB request 
> >>to all the other nodes in the exec_host list. The nodes respond with an 
> >>ALL_OKAY message and send the event_com as JOIN_JOB. After MS receives 
> >>ALL_OKAY from all the sister nodes, I can see that MS goes through the 
> >>processes of TMomFinalizeJob1, TMomFinalizeJob2, TMomFinalizeChild, 
> >>TMomFinalizeJob3 routines. In the TMomFinalizeChild routine MS starts up 
> >>a pbs_demux process also in addition to the job task. But what I don't 
> >>seem to understand is where exactly in this sequence does MS tell the 
> >>other nodes too to start the job. Could someone explain please?
> >>    
> >
> >It doesn't.  Once the sisters have the JOIN_JOB request, they are part
> >of the job.  Notice that the job list on sisters are always in state
> >"starting."
> Thanks Garrick,
> 
> I figured that late on friday. I notice that when the MOM gets a 
> ALL_OKAY from all the sisters, it starts up pbs_demux in a forked parent 
> and the mpirun command in the child. Now it is in the hands of the MPI 
> distribution, correct? Is this where Pete's mpiexec comes into picture, 
> using TM interface instead of relying on the SSH / rsh kind of job 
> startup by MPI?

Well, the MOM child doesn't run mpirun directly, it just runs the user's
script.  MOM doesn't know or care what the script does.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20060306/ed1ac8f6/attachment.bin


More information about the torquedev mailing list