[torqueusers] mount.nfs starts failing after pbs_server gets "warmed up", starts using up more than 700 sockets (Sabuj Pattanayek)Re:

Chris Hunter chris.hunter at yale.edu
Sat Apr 20 10:36:26 MDT 2013


This is a known problem. Solution is courtesy of brock palen who hosts 
the RCE podcasts.

http://www.supercluster.org/pipermail/torqueusers/2011-March/012425.html

> # release sockets faster because we use a lot of them
> net.ipv4.tcp_fin_timeout = 20
> # Reuse sockets as fast as possible
> net.ipv4.tcp_tw_reuse = 1
> net.ipv4.tcp_tw_recycle = 1
>
> You can also build torque to not use priv ports.
> Lastly you can increate job_stat_rate,

chris hunter
yale hpc group

> Message: 4
> Date: Sat, 20 Apr 2013 11:29:14 -0500
> From: Sabuj Pattanayek <sabujp at gmail.com>
> Subject: [torqueusers] mount.nfs starts failing after pbs_server gets
> 	"warmed up", starts using up more than 700 sockets
> To: "torqueusers at supercluster.org" <torqueusers at supercluster.org>
> Message-ID:
> 	<CAEeMGHuw_pwjqKAYVZ91fLAdav6WrdJ93=rPEVu_gWyrXefANA at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
> Anyone seen a problem where mount.nfs will start failing with :
>
> mount.nfs: mount(2): Input/output error
> mount.nfs: mount system call failed
> rc = 32 (return code)
>
> when pbs_server starts making lots of connections? I'm fairly certain I've
> tracked the problem down to pbs_server and not any other process because
> mount.nfs will reliably start working again after pbs_server is killed. We
> only have 36 nodes, the system running pbs_server is a KVM virtualized
> system running with 6 virtual procs, 9GB of RAM, system load is near 0:
>
> # uptime
>  11:27:27 up  9:10,  9 users,  load average: 0.11, 0.16, 0.30
>
> memory usage is negligible (free -m) :
>
>              total       used       free     shared    buffers     cached
> Mem:          8880       1401       7479          0        223        624
> -/+ buffers/cache:        553       8327
> Swap:         5119          0       5119
>
> I've tried renicing pbs_server to 20, and ionicing it to class 3 (idle) to
> no avail. Anyone have any other ideas?
>
> Thanks,
> Sabuj




More information about the torqueusers mailing list