[torqueusers] mount.nfs starts failing after pbs_server gets "warmed up", starts using up more than 700 sockets (Sabuj Pattanayek)Re:
Mark.Henshall at cancer.org.uk
Sat Apr 20 13:36:54 MDT 2013
I've been having the same issue - I'd set tcp_fin_timeout to 30, and
set tcp_tw_reuse (but not tcp_tw_reload) - I was still having problems.
I've lowered tcp_fin_timeout to twenty seconds, and set tcp_tw_recycle,
but I reckon I'm going to have to do a recompile - will the reconfig
mean I have to push out the pbs_mom binary to the nodes (I can see that
being an issue), or is it just the pbs_server binary that'll need
replacing and restarting? If it's the latter, then great.
Quoting Sabuj Pattanayek <sabujp at gmail.com>:
> Thanks, I've been pulling my hair out over this for quite some time now!
> Recompiled with --disable-privports and looks like that did the trick.
> On Sat, Apr 20, 2013 at 11:36 AM, Chris Hunter <chris.hunter at yale.edu>wrote:
>> This is a known problem. Solution is courtesy of brock palen who hosts the
>> RCE podcasts.
>> # release sockets faster because we use a lot of them
>>> net.ipv4.tcp_fin_timeout = 20
>>> # Reuse sockets as fast as possible
>>> net.ipv4.tcp_tw_reuse = 1
>>> net.ipv4.tcp_tw_recycle = 1
>>> You can also build torque to not use priv ports.
>>> Lastly you can increate job_stat_rate,
>> chris hunter
>> yale hpc group
>> Message: 4
>>> Date: Sat, 20 Apr 2013 11:29:14 -0500
>>> From: Sabuj Pattanayek <sabujp at gmail.com>
>>> Subject: [torqueusers] mount.nfs starts failing after pbs_server gets
>>> "warmed up", starts using up more than 700 sockets
>>> To: "torqueusers at supercluster.org" <torqueusers at supercluster.org>
>>> _gWyrXefANA at mail.gmail.com <rPEVu_gWyrXefANA at mail.gmail.com>>
>>> Content-Type: text/plain; charset="iso-8859-1"
>>> Anyone seen a problem where mount.nfs will start failing with :
>>> mount.nfs: mount(2): Input/output error
>>> mount.nfs: mount system call failed
>>> rc = 32 (return code)
>>> when pbs_server starts making lots of connections? I'm fairly certain I've
>>> tracked the problem down to pbs_server and not any other process because
>>> mount.nfs will reliably start working again after pbs_server is killed. We
>>> only have 36 nodes, the system running pbs_server is a KVM virtualized
>>> system running with 6 virtual procs, 9GB of RAM, system load is near 0:
>>> # uptime
>>> 11:27:27 up 9:10, 9 users, load average: 0.11, 0.16, 0.30
>>> memory usage is negligible (free -m) :
>>> total used free shared buffers cached
>>> Mem: 8880 1401 7479 0 223 624
>>> -/+ buffers/cache: 553 8327
>>> Swap: 5119 0 5119
>>> I've tried renicing pbs_server to 20, and ionicing it to class 3 (idle) to
>>> no avail. Anyone have any other ideas?
Cancer Research UK London Research Institute
Lincoln's Inn Fields Laboratories
44 Lincoln's Inn Fields
London WC2A 3LY
Registered charity number 1089464
t: 0207 269 3602
f: 0207 061 8011
e: Mark.Henshall at cancer.org.uk
NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose.
We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you.
Cancer Research UK
Registered charity in England and Wales (1089464), Scotland (SC041666) and the Isle of Man (1103)
A company limited by guarantee. Registered company in England and Wales (4325234) and the Isle of Man (5713F).
Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD.
More information about the torqueusers