[torqueusers] mount.nfs starts failing after pbs_server gets "warmed up", starts using up more than 700 sockets (Sabuj Pattanayek)Re:

Sabuj Pattanayek sabujp at gmail.com
Sat Apr 20 11:18:59 MDT 2013


Thanks, I've been pulling my hair out over this for quite some time now!
Recompiled with --disable-privports and looks like that did the trick.


On Sat, Apr 20, 2013 at 11:36 AM, Chris Hunter <chris.hunter at yale.edu>wrote:

>
> This is a known problem. Solution is courtesy of brock palen who hosts the
> RCE podcasts.
>
> http://www.supercluster.org/**pipermail/torqueusers/2011-**
> March/012425.html<http://www.supercluster.org/pipermail/torqueusers/2011-March/012425.html>
>
>  # release sockets faster because we use a lot of them
>> net.ipv4.tcp_fin_timeout = 20
>> # Reuse sockets as fast as possible
>> net.ipv4.tcp_tw_reuse = 1
>> net.ipv4.tcp_tw_recycle = 1
>>
>> You can also build torque to not use priv ports.
>> Lastly you can increate job_stat_rate,
>>
>
> chris hunter
> yale hpc group
>
>  Message: 4
>> Date: Sat, 20 Apr 2013 11:29:14 -0500
>> From: Sabuj Pattanayek <sabujp at gmail.com>
>> Subject: [torqueusers] mount.nfs starts failing after pbs_server gets
>>         "warmed up", starts using up more than 700 sockets
>> To: "torqueusers at supercluster.org" <torqueusers at supercluster.org>
>> Message-ID:
>>         <CAEeMGHuw_**pwjqKAYVZ91fLAdav6WrdJ93=rPEVu**
>> _gWyrXefANA at mail.gmail.com <rPEVu_gWyrXefANA at mail.gmail.com>>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Hi,
>>
>> Anyone seen a problem where mount.nfs will start failing with :
>>
>> mount.nfs: mount(2): Input/output error
>> mount.nfs: mount system call failed
>> rc = 32 (return code)
>>
>> when pbs_server starts making lots of connections? I'm fairly certain I've
>> tracked the problem down to pbs_server and not any other process because
>> mount.nfs will reliably start working again after pbs_server is killed. We
>> only have 36 nodes, the system running pbs_server is a KVM virtualized
>> system running with 6 virtual procs, 9GB of RAM, system load is near 0:
>>
>> # uptime
>>  11:27:27 up  9:10,  9 users,  load average: 0.11, 0.16, 0.30
>>
>> memory usage is negligible (free -m) :
>>
>>              total       used       free     shared    buffers     cached
>> Mem:          8880       1401       7479          0        223        624
>> -/+ buffers/cache:        553       8327
>> Swap:         5119          0       5119
>>
>> I've tried renicing pbs_server to 20, and ionicing it to class 3 (idle) to
>> no avail. Anyone have any other ideas?
>>
>> Thanks,
>> Sabuj
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130420/27dc16bc/attachment.html 


More information about the torqueusers mailing list