[torqueusers] PBS job failure when trying to run an MPI program on multiple nodes

Daniel Widyono widyono at seas.upenn.edu
Tue Jun 27 14:29:00 MDT 2006


> implicitly) try to use more than 1 nodes that I get these errors.  It is 
> as though the files are copied back in a different fashion depending on 
> whether you use 1 or more than 1 node.  Is one node considered a 
> "control node?"

Yes.  This is the design of PBS.

> Might it be that copying information from non-control 
> nodes to the control node blows away the file(s)?

The $usecp parameter should have all the hostnames.  E.g.  *:/home /home, but
for my clusters we use an internal network, with internal hostnames, so it
looks like this: $usecp *.clustername.internal:/home /home.  It appears your
$usecp is not being used appropriately, it might need to look like

	$usecp *.local:/home /home.

> It looks like the mom_priv/config file on the compute nodes contains:
> 
> $pbsserver c2.local
> $usecp c2.cs.princeton.edu:/home /home

> is a problem with the compute nodes using the frontend's public address 
> rather than the internal address, so I suppose I'll have to address that 

That would be the $PBS_HOME/server_name on the nodes.  Also /etc/hosts.

Finally, if you can't ssh from node to node, you'll need to fix that as well.
Google for "site:liniac.upenn.edu hostbased authentication" for an example of
how to set up hostbased SSH across the internal network.

HTH,
Dan W.


More information about the torqueusers mailing list