[torqueusers] Re: LAM-MPI won't boot with torque-1.2.0p6

garrick garrick at usc.edu
Thu Sep 15 15:18:56 MDT 2005


On Thu, Sep 15, 2005 at 11:07:54PM +0200, Ole Holm Nielsen alleged:
> Speaking of a pbs_demux process, when would that be started ?
> It's not running on the nodes after I start an interactive PBS job.

It is supposed to be started at the launch of all multi-node jobs.

 
> >Actually, if you configured torque with --enable-syslog you should have
> >errors related to open_demux() in your syslog.
> 
> Right you are, I see some errors !  On the PBS job master node:
> 
> Sep 15 20:58:27 n469 pbs_mom: Connection refused (111) in open_demux, 
> open_demux: connect 127.0.0.1:34976
> Sep 15 21:29:43 n469 pbs_mom: Connection refused (111) in open_demux, 
> open_demux: connect 127.0.0.1:34987
> Sep 15 21:34:17 n469 pbs_mom: Connection refused (111) in open_demux, 
> open_demux: connect 127.0.0.1:34999
> 
> On a slave node:
> 
> Sep 15 20:58:33 n478 pbs_mom: Connection refused (111) in open_demux, 
> open_demux: connect 10.1.130.219:34976
> Sep 15 21:29:49 n478 pbs_mom: Connection refused (111) in open_demux, 
> open_demux: connect 10.1.130.219:34987
> Sep 15 21:34:23 n478 pbs_mom: Connection refused (111) in open_demux, 
> open_demux: connect 10.1.130.219:34999
> 
> Here 10.1.130.219 is the IP-address of the job master node, n469.
> 
> It seems to me we're getting closer, but what config parameter
> would control access to open_demux() ?
> 
> FYI, I've installed these Torque RPMs (based somewhat on your
> torque.spec file) on the nodes:
> 
> # rpm -qa | grep torque
> torque-1.2.0p6-1.fys
> torque-mom-1.2.0p6-1.fys
> torque-client-1.2.0p6-1.fys
> 
> The nodes have the following files installed in /usr/sbin:
> 
> # ls -la /usr/sbin/pbs*
> -rwxr-xr-x  1 root root  15950 Sep 13 13:46 /usr/sbin/pbs_demux
> -rwsr-xr-x  1 root root  85882 Sep 13 13:46 /usr/sbin/pbs_iff
> -rwx------  1 root root 697041 Sep 13 13:46 /usr/sbin/pbs_mom
> -rwsr-xr-x  1 root root  36913 Sep 13 13:46 /usr/sbin/pbs_rcp
> 
> So pbs_demux is actually installed.  It's part of the torque-client
> RPM, but shouldn't it be part of the torque-mom RPM in stead ?

Guess that solves that.  You don't have pbs_demux on the nodes because
my spec file is wrong!  I've never noticed because I've always had
torque-client installed on the nodes.

Unfortunately the error message that should have gone to syslog when
pbs_demux wasn't exec'd was broken.  Funny thing, I just fixed this in
CVS right after 1.2.0p6 was released.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050915/c7fcafb5/attachment-0001.bin


More information about the torqueusers mailing list