[torquedev] pbs_demux

Prakash Velayutham velayups at email.uc.edu
Mon Mar 6 12:11:11 MST 2006


Garrick Staples wrote:
> On Fri, Mar 03, 2006 at 10:29:53AM -0500, Prakash Velayutham alleged:
>   
>> Hi All,
>>
>> This question is regarding a multi-node MPI kind of job. After the job 
>> is scheduled and sent over to the MS, MS first does a JOIN_JOB request 
>> to all the other nodes in the exec_host list. The nodes respond with an 
>> ALL_OKAY message and send the event_com as JOIN_JOB. After MS receives 
>> ALL_OKAY from all the sister nodes, I can see that MS goes through the 
>> processes of TMomFinalizeJob1, TMomFinalizeJob2, TMomFinalizeChild, 
>> TMomFinalizeJob3 routines. In the TMomFinalizeChild routine MS starts up 
>> a pbs_demux process also in addition to the job task. But what I don't 
>> seem to understand is where exactly in this sequence does MS tell the 
>> other nodes too to start the job. Could someone explain please?
>>     
>
> It doesn't.  Once the sisters have the JOIN_JOB request, they are part
> of the job.  Notice that the job list on sisters are always in state
> "starting."
Thanks Garrick,

I figured that late on friday. I notice that when the MOM gets a 
ALL_OKAY from all the sisters, it starts up pbs_demux in a forked parent 
and the mpirun command in the child. Now it is in the hands of the MPI 
distribution, correct? Is this where Pete's mpiexec comes into picture, 
using TM interface instead of relying on the SSH / rsh kind of job 
startup by MPI?

Prakash



More information about the torquedev mailing list