[torqueusers] Random SCP errors when transfering to/from CREAM sandbox

Gila Arrondo Miguel Angel miguel.gila at cscs.ch
Thu Nov 10 08:37:48 MST 2011

Hi all, 

We are seeing a lot of "pbs_mom: scp" transfer errors in our /var/log/messages, but the files mentioned in these errors are there and are accessible. 

This is an example of error:

wn113: Nov  9 14:54:57 wn113 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB out_cre02_208919067_StandardOutput cms079 at cream02.lcg.cscs.ch:/cream_localsandbox/data/cms/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_eaguiloc_CN_555092_CN_Ernest_Aguilo_Chivite_cms_Role_NULL_Capability_NULL_cms079/20/CREAM208919067/StandardOutput' failed with status=1, giving up after 4 attempts

These kind of errors happen everyday without any specific correlation to cron jobs or any other cfengine tasks done on a regular scheduled base. Here is a summary them in all the WNs in our cluster.

Total: 12 in cream02 on Nov 10
Total: 2 in cream01 on Nov 10

Total: 72 in cream02 on Nov 09 
Total: 74 in cream01 on Nov 09 

Total: 52 in cream02 on Nov 08
Total: 2 in cream01 on Nov 08

Total: 212 in cream02 on Nov 07
Total: 36 in cream01 on Nov 07

Total: 1240 in cream02 on Nov 06
Total: 465 in cream01 on Nov 06

At the moment there are two CREAM-CEs (the endpoint host of these scp transfers), one is a VM (cream01) and the other is a physical machine (cream02), each with its own local cream_sandbox directory (endpoint location of the scp transfers) and enough computing power to attend ssh connections and all the rest of the CREAM services. Initially we had the cream_sandbox shared through a Lustre filesystem, but since it was unreliable and became very slow at times (jobs ran there), we decided to move it to the local disk. These issues did not happen before: since the sandbox was shared, we used regular $usecp

We are aware that you can tune this with the directive $rcpcmd in the config file of pbs_mom, but since we are not sure what the error may be, we don't know what to change in the settings. The value of MaxStartups in /etc/ssh/sshd_config is set to 20000

> MaxStartups 20000

We've checked the /var/log/secure for scp errors, but everything seems to be ok there.

Any idea on what could be wrong?? 

Thanks in advance,

Miguel Gila
CSCS Swiss National Supercomputing Centre 
HPC Solutions
Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
miguel.gila at cscs.ch | www.cscs.ch | Phone +41 91 610 82 22

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20111110/f9436ec9/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3239 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20111110/f9436ec9/attachment-0001.bin 

More information about the torqueusers mailing list