Robert Oostenveld r.oostenveld at donders.ru.nl
Fri Nov 11 01:41:43 MST 2011

Dear Miguel,

On 10 Nov 2011, at 16:37, Gila Arrondo Miguel Angel wrote:
> We are seeing a lot of "pbs_mom: scp" transfer errors in our /var/log/messages, but the files mentioned in these errors are there and are accessible. 
> This is an example of error:
> ...

I don't know whether it may be related, but we have also had a problem with scp which I tracked down to the users having an incorrect (i.e. not up-to-date) .ssh/known_hosts file in ther NFS shared home directory. 

We have many non-torque-cluster linux computers from which jobs can be submitted, and sometimes these are updated/reinstalled which invalidates the ssh host key that was previously assigned to their IP address. The users that had a correct known_hosts or that had specified StrictHostKeyChecking=no in their .ssh/config file did not have any problems, but the users that had an outdated known_hosts did encounter problems (but then only when submitting from one of the nodes where the host key had changed).

The consequence was that on the torque execute hosts scp would look into the user's known_hosts, and depending from where the job was submitted would find a correct (for some users) or an incorrect (for other users) host key for the submit client. 

A possible solution would have been to ensure that all users' known_hosts was correct or that all users had StrictHostKeyChecking=no. But our specific solution was to specify
$usecp *:/home /home
in the /var/spool/torque/mom_priv/config, as scp was not needed anyway because of our shared NFS home directory.

