[torqueusers] Random SCP errors when transfering to/from CREAM sandbox

Gila Arrondo Miguel Angel miguel.gila at cscs.ch
Wed Nov 16 09:24:56 MST 2011


Dear Robert,

Many thanks for your answer. We've made sure that the keys are okay, as well as disabling hoskeychecking to test it. 

We've also tunned some TCP values with the hope that the transfers would not fail:

net.core.rmem_max = 87380000
net.core.wmem_max = 65536000
net.ipv4.tcp_rmem = 8192 873800 87380000
net.ipv4.tcp_wmem = 4096 655360 65536000

But none of this has worked.

The WNs are connecting to the CREAMs via a vnic over infiniband. We know it's not the best scenario to debug network issues... but we have sustained gridftp connections along with a mix of other protocols and we have seen no problems so far. This must be something related to torque/ssh...

Any other ideas??


Cheers,
Miguel




On Nov 11, 2011, at 9:41 AM, Robert Oostenveld wrote:

> Dear Miguel,
> 
> 
> On 10 Nov 2011, at 16:37, Gila Arrondo Miguel Angel wrote:
>> We are seeing a lot of "pbs_mom: scp" transfer errors in our /var/log/messages, but the files mentioned in these errors are there and are accessible. 
>> 
>> This is an example of error:
>> ...
> 
> 
> I don't know whether it may be related, but we have also had a problem with scp which I tracked down to the users having an incorrect (i.e. not up-to-date) .ssh/known_hosts file in ther NFS shared home directory. 
> 
> We have many non-torque-cluster linux computers from which jobs can be submitted, and sometimes these are updated/reinstalled which invalidates the ssh host key that was previously assigned to their IP address. The users that had a correct known_hosts or that had specified StrictHostKeyChecking=no in their .ssh/config file did not have any problems, but the users that had an outdated known_hosts did encounter problems (but then only when submitting from one of the nodes where the host key had changed).
> 
> The consequence was that on the torque execute hosts scp would look into the user's known_hosts, and depending from where the job was submitted would find a correct (for some users) or an incorrect (for other users) host key for the submit client. 
> 
> A possible solution would have been to ensure that all users' known_hosts was correct or that all users had StrictHostKeyChecking=no. But our specific solution was to specify
> $usecp *:/home /home
> in the /var/spool/torque/mom_priv/config, as scp was not needed anyway because of our shared NFS home directory.
> 
> best regards,
> Robert
> 
> 
> -----------------------------------------------------------
> Robert Oostenveld, PhD
> Senior Researcher & MEG Physicist
> Donders Institute for Brain, Cognition and Behaviour
> Centre for Cognitive Neuroimaging
> Radboud University Nijmegen
> tel.: +31 (0)24 3619695
> e-mail: r.oostenveld at donders.ru.nl
> web: http://www.ru.nl/neuroimaging
> skype: r.oostenveld
> -----------------------------------------------------------
> 
> 
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

--
Miguel Gila
CSCS Swiss National Supercomputing Centre 
HPC Solutions
Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3239 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111116/965fa968/attachment-0001.bin 


More information about the torqueusers mailing list