[torqueusers] sporadic scp failures

Joshua Bernstein jbernstein at penguincomputing.com
Wed Feb 17 12:47:23 MST 2010


Excellent Jeff,

	Thank you for the update!

-Josh

Jeff Anderson-Lee wrote:
> Adding the following line to /etc/ssh/sshd_config seems to have resolved 
> the problem:
>     MaxStartups 256
> 
> In our case, sshd is running as an init.d daemon and not via inetd so 
> there does not seem to be any other involvement in spawning jobs.
> 
> Jeff
> 
> Jeff Anderson-Lee wrote:
>> I haven't found anything significant in the logs on the head node -- the 
>> connections are seemingly silently refused.  I did find that sshd was 
>> using the default value of 10 unauthenticated connections (MaxStartups) 
>> though and bumped that to 100.  I'll try another test batch soon.
>>
>> Joshua Bernstein wrote:
>>   
>>> Jeff,
>>>
>>>     Have you looked through the pbs_mom log files, or even 
>>> /var/log/messages on the headnode? You might be running into a 
>>> situation where either pbs_mom or sshd (on the headnode) is running 
>>> out of open file descriptors. If you're using bash shell, you have a 
>>> look at the maximum number of open files per process using:
>>>
>>> $ ulimit -n
>>> 1024
>>>
>>> Generally this number is set to 1024 by default, but if you have a 
>>> large cluster, and the headnode is rather busy, SSHD may not be able 
>>> to fork() in order to receive the incoming SCP connection.
>>>
>>> -Joshua Bernstein
>>> Senior Software Engineer
>>> Penguin Computing
>>>
>>> Jeff Anderson-Lee wrote:
>>>     
>>>> I'm getting sporadic failures when it tries to copy the results .ER 
>>>> and .OU files back.  It is not 100% of the time, nor is is 100% 
>>>> consistent on which hosts have problems.  Sometimes the same host 
>>>> will succeed for one or both files and sometimes it will fail for both.
>>>>
>>>> I'm wondering if this might have something to do with too many scp 
>>>> requests showing up simultaneously and some sort of rate-limiting 
>>>> happening.  Any suggestions on where I might look?  What I might 
>>>> tweak?  Is there some way to increase the default socket backlog, or 
>>>> that used by inetd/sshd?
>>>>
>>>> Thanks.
>>>>
>>>> Jeff Anderson-Lee
>>>>
>>>>       
>>>>> PBS Job Id: 958.XXX.berkeley.edu
>>>>> Job Name:   STDIN
>>>>> Exec host:  s103/11
>>>>> An error has occurred processing your job, see below.
>>>>> Post job file processing error; job 958.XXX.berkeley.edu on host 
>>>>> s103/11
>>>>>
>>>>> Unable to copy file /var/spool/torque/spool/958.XXX.berkeley.edu.OU 
>>>>> to jonah at XXX.berkeley.edu:/home/cs/jonah/STDIN.o958
>>>>> *** error from copy
>>>>> ssh_exchange_identification: Connection closed by remote host
>>>>> lost connection
>>>>> *** end error output
>>>>> Output retained on that host in: 
>>>>> /var/spool/torque/undelivered/958.XXX.berkeley.edu.OU
>>>>>         
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>       
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>   
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list