[torqueusers] PBS job failure when trying to run an MPI program
on multiple nodes
widyono at seas.upenn.edu
Tue Jun 27 14:29:00 MDT 2006
> implicitly) try to use more than 1 nodes that I get these errors. It is
> as though the files are copied back in a different fashion depending on
> whether you use 1 or more than 1 node. Is one node considered a
> "control node?"
Yes. This is the design of PBS.
> Might it be that copying information from non-control
> nodes to the control node blows away the file(s)?
The $usecp parameter should have all the hostnames. E.g. *:/home /home, but
for my clusters we use an internal network, with internal hostnames, so it
looks like this: $usecp *.clustername.internal:/home /home. It appears your
$usecp is not being used appropriately, it might need to look like
$usecp *.local:/home /home.
> It looks like the mom_priv/config file on the compute nodes contains:
> $pbsserver c2.local
> $usecp c2.cs.princeton.edu:/home /home
> is a problem with the compute nodes using the frontend's public address
> rather than the internal address, so I suppose I'll have to address that
That would be the $PBS_HOME/server_name on the nodes. Also /etc/hosts.
Finally, if you can't ssh from node to node, you'll need to fix that as well.
Google for "site:liniac.upenn.edu hostbased authentication" for an example of
how to set up hostbased SSH across the internal network.
More information about the torqueusers