[torqueusers] Re: LAM-MPI won't boot with torque-1.2.0p6
garrick
garrick at usc.edu
Thu Sep 15 15:18:56 MDT 2005
On Thu, Sep 15, 2005 at 11:07:54PM +0200, Ole Holm Nielsen alleged:
> Speaking of a pbs_demux process, when would that be started ?
> It's not running on the nodes after I start an interactive PBS job.
It is supposed to be started at the launch of all multi-node jobs.
> >Actually, if you configured torque with --enable-syslog you should have
> >errors related to open_demux() in your syslog.
>
> Right you are, I see some errors ! On the PBS job master node:
>
> Sep 15 20:58:27 n469 pbs_mom: Connection refused (111) in open_demux,
> open_demux: connect 127.0.0.1:34976
> Sep 15 21:29:43 n469 pbs_mom: Connection refused (111) in open_demux,
> open_demux: connect 127.0.0.1:34987
> Sep 15 21:34:17 n469 pbs_mom: Connection refused (111) in open_demux,
> open_demux: connect 127.0.0.1:34999
>
> On a slave node:
>
> Sep 15 20:58:33 n478 pbs_mom: Connection refused (111) in open_demux,
> open_demux: connect 10.1.130.219:34976
> Sep 15 21:29:49 n478 pbs_mom: Connection refused (111) in open_demux,
> open_demux: connect 10.1.130.219:34987
> Sep 15 21:34:23 n478 pbs_mom: Connection refused (111) in open_demux,
> open_demux: connect 10.1.130.219:34999
>
> Here 10.1.130.219 is the IP-address of the job master node, n469.
>
> It seems to me we're getting closer, but what config parameter
> would control access to open_demux() ?
>
> FYI, I've installed these Torque RPMs (based somewhat on your
> torque.spec file) on the nodes:
>
> # rpm -qa | grep torque
> torque-1.2.0p6-1.fys
> torque-mom-1.2.0p6-1.fys
> torque-client-1.2.0p6-1.fys
>
> The nodes have the following files installed in /usr/sbin:
>
> # ls -la /usr/sbin/pbs*
> -rwxr-xr-x 1 root root 15950 Sep 13 13:46 /usr/sbin/pbs_demux
> -rwsr-xr-x 1 root root 85882 Sep 13 13:46 /usr/sbin/pbs_iff
> -rwx------ 1 root root 697041 Sep 13 13:46 /usr/sbin/pbs_mom
> -rwsr-xr-x 1 root root 36913 Sep 13 13:46 /usr/sbin/pbs_rcp
>
> So pbs_demux is actually installed. It's part of the torque-client
> RPM, but shouldn't it be part of the torque-mom RPM in stead ?
Guess that solves that. You don't have pbs_demux on the nodes because
my spec file is wrong! I've never noticed because I've always had
torque-client installed on the nodes.
Unfortunately the error message that should have gone to syslog when
pbs_demux wasn't exec'd was broken. Funny thing, I just fixed this in
CVS right after 1.2.0p6 was released.
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050915/c7fcafb5/attachment-0001.bin
More information about the torqueusers
mailing list