[torqueusers] sporadic scp failures

Jeff Anderson-Lee jonah at eecs.berkeley.edu
Wed Feb 17 12:44:55 MST 2010


Adding the following line to /etc/ssh/sshd_config seems to have resolved 
the problem:
    MaxStartups 256

In our case, sshd is running as an init.d daemon and not via inetd so 
there does not seem to be any other involvement in spawning jobs.

Jeff

Jeff Anderson-Lee wrote:
> I haven't found anything significant in the logs on the head node -- the 
> connections are seemingly silently refused.  I did find that sshd was 
> using the default value of 10 unauthenticated connections (MaxStartups) 
> though and bumped that to 100.  I'll try another test batch soon.
>
> Joshua Bernstein wrote:
>   
>> Jeff,
>>
>>     Have you looked through the pbs_mom log files, or even 
>> /var/log/messages on the headnode? You might be running into a 
>> situation where either pbs_mom or sshd (on the headnode) is running 
>> out of open file descriptors. If you're using bash shell, you have a 
>> look at the maximum number of open files per process using:
>>
>> $ ulimit -n
>> 1024
>>
>> Generally this number is set to 1024 by default, but if you have a 
>> large cluster, and the headnode is rather busy, SSHD may not be able 
>> to fork() in order to receive the incoming SCP connection.
>>
>> -Joshua Bernstein
>> Senior Software Engineer
>> Penguin Computing
>>
>> Jeff Anderson-Lee wrote:
>>     
>>> I'm getting sporadic failures when it tries to copy the results .ER 
>>> and .OU files back.  It is not 100% of the time, nor is is 100% 
>>> consistent on which hosts have problems.  Sometimes the same host 
>>> will succeed for one or both files and sometimes it will fail for both.
>>>
>>> I'm wondering if this might have something to do with too many scp 
>>> requests showing up simultaneously and some sort of rate-limiting 
>>> happening.  Any suggestions on where I might look?  What I might 
>>> tweak?  Is there some way to increase the default socket backlog, or 
>>> that used by inetd/sshd?
>>>
>>> Thanks.
>>>
>>> Jeff Anderson-Lee
>>>
>>>       
>>>> PBS Job Id: 958.XXX.berkeley.edu
>>>> Job Name:   STDIN
>>>> Exec host:  s103/11
>>>> An error has occurred processing your job, see below.
>>>> Post job file processing error; job 958.XXX.berkeley.edu on host 
>>>> s103/11
>>>>
>>>> Unable to copy file /var/spool/torque/spool/958.XXX.berkeley.edu.OU 
>>>> to jonah at XXX.berkeley.edu:/home/cs/jonah/STDIN.o958
>>>> *** error from copy
>>>> ssh_exchange_identification: Connection closed by remote host
>>>> lost connection
>>>> *** end error output
>>>> Output retained on that host in: 
>>>> /var/spool/torque/undelivered/958.XXX.berkeley.edu.OU
>>>>         
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>       
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   



More information about the torqueusers mailing list