[torquedev] pbs_demux
Garrick Staples
garrick at usc.edu
Mon Mar 6 12:19:58 MST 2006
On Mon, Mar 06, 2006 at 02:11:11PM -0500, Prakash Velayutham alleged:
> Garrick Staples wrote:
> >On Fri, Mar 03, 2006 at 10:29:53AM -0500, Prakash Velayutham alleged:
> >
> >>Hi All,
> >>
> >>This question is regarding a multi-node MPI kind of job. After the job
> >>is scheduled and sent over to the MS, MS first does a JOIN_JOB request
> >>to all the other nodes in the exec_host list. The nodes respond with an
> >>ALL_OKAY message and send the event_com as JOIN_JOB. After MS receives
> >>ALL_OKAY from all the sister nodes, I can see that MS goes through the
> >>processes of TMomFinalizeJob1, TMomFinalizeJob2, TMomFinalizeChild,
> >>TMomFinalizeJob3 routines. In the TMomFinalizeChild routine MS starts up
> >>a pbs_demux process also in addition to the job task. But what I don't
> >>seem to understand is where exactly in this sequence does MS tell the
> >>other nodes too to start the job. Could someone explain please?
> >>
> >
> >It doesn't. Once the sisters have the JOIN_JOB request, they are part
> >of the job. Notice that the job list on sisters are always in state
> >"starting."
> Thanks Garrick,
>
> I figured that late on friday. I notice that when the MOM gets a
> ALL_OKAY from all the sisters, it starts up pbs_demux in a forked parent
> and the mpirun command in the child. Now it is in the hands of the MPI
> distribution, correct? Is this where Pete's mpiexec comes into picture,
> using TM interface instead of relying on the SSH / rsh kind of job
> startup by MPI?
Well, the MOM child doesn't run mpirun directly, it just runs the user's
script. MOM doesn't know or care what the script does.
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20060306/ed1ac8f6/attachment.bin
More information about the torquedev
mailing list