[torqueusers] Re: LAM-MPI won't boot with torque-1.2.0p6

Troy Baer troy at osc.edu
Thu Sep 15 11:10:56 MDT 2005


On Thu, 2005-09-15 at 19:01 +0200, Ole Holm Nielsen wrote:
> Garrick Staples wrote:
> >>Question:  Is Torque's LAM-MPI "tm" boot schema supposed to be
> >>> working correctly with torque-1.2.0p6 ?  I'd love to get it to
> >>> work because of the performance improvements promised in the
> >>> LAM-MPI documentation.
> >  
> > It absolutely should be working.  Can you try something really simple
> > like 'pbsdsh hostname' in your job?  Optionally, 'pbsdsh -v hostname'.
> > If it is failing, check the mom logs with an increased loglevel.
> 
> The result is very interesting, showing obvious errors:
> 
> $ pbsdsh -v hostname
> pbsdsh: spawned task 0
> pbsdsh: spawned task 1
> pbsdsh: spawned task 2
> pbsdsh: waiting on 3 spawned and 0 obits
> spawn event returned: 0
> error 17000 on spawn
> pbsdsh: waiting on 2 spawned and 0 obits
> spawn event returned: 1
> error 15010 on spawn
> pbsdsh: waiting on 1 spawned and 0 obits
> spawn event returned: 2
> error 15010 on spawn
> 
> I also tried pbsdsh 'echo $PATH', as seen in the logs below,
> with the same bad result.  I suppose these errors mean that
> the problem is not related to LAM-MPI, but to torque itself.

Do you have $clienthost entries in $PBS_HOME/mom_priv/config for all of
your compute nodes?  If not, I suspect that's your problem, as pbs_mom
needs a $clienthost entry for every host that's allowed to talk to it,
server and moms.

	--Troy
-- 
Troy Baer                       troy at osc.edu
Science & Technology Support    http://www.osc.edu/hpc/
Ohio Supercomputer Center       614-292-9701



More information about the torqueusers mailing list