[torqueusers] sporadic scp failures

Jeff Anderson-Lee jonah at eecs.berkeley.edu
Wed Feb 17 12:14:01 MST 2010


I haven't found anything significant in the logs on the head node -- the 
connections are seemingly silently refused.  I did find that sshd was 
using the default value of 10 unauthenticated connections (MaxStartups) 
though and bumped that to 100.  I'll try another test batch soon.

Joshua Bernstein wrote:
> Jeff,
>
>     Have you looked through the pbs_mom log files, or even 
> /var/log/messages on the headnode? You might be running into a 
> situation where either pbs_mom or sshd (on the headnode) is running 
> out of open file descriptors. If you're using bash shell, you have a 
> look at the maximum number of open files per process using:
>
> $ ulimit -n
> 1024
>
> Generally this number is set to 1024 by default, but if you have a 
> large cluster, and the headnode is rather busy, SSHD may not be able 
> to fork() in order to receive the incoming SCP connection.
>
> -Joshua Bernstein
> Senior Software Engineer
> Penguin Computing
>
> Jeff Anderson-Lee wrote:
>> I'm getting sporadic failures when it tries to copy the results .ER 
>> and .OU files back.  It is not 100% of the time, nor is is 100% 
>> consistent on which hosts have problems.  Sometimes the same host 
>> will succeed for one or both files and sometimes it will fail for both.
>>
>> I'm wondering if this might have something to do with too many scp 
>> requests showing up simultaneously and some sort of rate-limiting 
>> happening.  Any suggestions on where I might look?  What I might 
>> tweak?  Is there some way to increase the default socket backlog, or 
>> that used by inetd/sshd?
>>
>> Thanks.
>>
>> Jeff Anderson-Lee
>>
>>> PBS Job Id: 958.XXX.berkeley.edu
>>> Job Name:   STDIN
>>> Exec host:  s103/11
>>> An error has occurred processing your job, see below.
>>> Post job file processing error; job 958.XXX.berkeley.edu on host 
>>> s103/11
>>>
>>> Unable to copy file /var/spool/torque/spool/958.XXX.berkeley.edu.OU 
>>> to jonah at XXX.berkeley.edu:/home/cs/jonah/STDIN.o958
>>> *** error from copy
>>> ssh_exchange_identification: Connection closed by remote host
>>> lost connection
>>> *** end error output
>>> Output retained on that host in: 
>>> /var/spool/torque/undelivered/958.XXX.berkeley.edu.OU
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list