[torqueusers] mount.nfs starts failing after pbs_server gets "warmed up", starts using up more than 700 sockets (Sabuj Pattanayek)Re:
sabujp at gmail.com
Sat Apr 20 11:18:59 MDT 2013
Thanks, I've been pulling my hair out over this for quite some time now!
Recompiled with --disable-privports and looks like that did the trick.
On Sat, Apr 20, 2013 at 11:36 AM, Chris Hunter <chris.hunter at yale.edu>wrote:
> This is a known problem. Solution is courtesy of brock palen who hosts the
> RCE podcasts.
> # release sockets faster because we use a lot of them
>> net.ipv4.tcp_fin_timeout = 20
>> # Reuse sockets as fast as possible
>> net.ipv4.tcp_tw_reuse = 1
>> net.ipv4.tcp_tw_recycle = 1
>> You can also build torque to not use priv ports.
>> Lastly you can increate job_stat_rate,
> chris hunter
> yale hpc group
> Message: 4
>> Date: Sat, 20 Apr 2013 11:29:14 -0500
>> From: Sabuj Pattanayek <sabujp at gmail.com>
>> Subject: [torqueusers] mount.nfs starts failing after pbs_server gets
>> "warmed up", starts using up more than 700 sockets
>> To: "torqueusers at supercluster.org" <torqueusers at supercluster.org>
>> _gWyrXefANA at mail.gmail.com <rPEVu_gWyrXefANA at mail.gmail.com>>
>> Content-Type: text/plain; charset="iso-8859-1"
>> Anyone seen a problem where mount.nfs will start failing with :
>> mount.nfs: mount(2): Input/output error
>> mount.nfs: mount system call failed
>> rc = 32 (return code)
>> when pbs_server starts making lots of connections? I'm fairly certain I've
>> tracked the problem down to pbs_server and not any other process because
>> mount.nfs will reliably start working again after pbs_server is killed. We
>> only have 36 nodes, the system running pbs_server is a KVM virtualized
>> system running with 6 virtual procs, 9GB of RAM, system load is near 0:
>> # uptime
>> 11:27:27 up 9:10, 9 users, load average: 0.11, 0.16, 0.30
>> memory usage is negligible (free -m) :
>> total used free shared buffers cached
>> Mem: 8880 1401 7479 0 223 624
>> -/+ buffers/cache: 553 8327
>> Swap: 5119 0 5119
>> I've tried renicing pbs_server to 20, and ionicing it to class 3 (idle) to
>> no avail. Anyone have any other ideas?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers