[torqueusers] mount.nfs starts failing after pbs_server gets "warmed up", starts using up more than 700 sockets

Sabuj Pattanayek sabujp at gmail.com
Sat Apr 20 10:29:14 MDT 2013


Anyone seen a problem where mount.nfs will start failing with :

mount.nfs: mount(2): Input/output error
mount.nfs: mount system call failed
rc = 32 (return code)

when pbs_server starts making lots of connections? I'm fairly certain I've
tracked the problem down to pbs_server and not any other process because
mount.nfs will reliably start working again after pbs_server is killed. We
only have 36 nodes, the system running pbs_server is a KVM virtualized
system running with 6 virtual procs, 9GB of RAM, system load is near 0:

# uptime
 11:27:27 up  9:10,  9 users,  load average: 0.11, 0.16, 0.30

memory usage is negligible (free -m) :

             total       used       free     shared    buffers     cached
Mem:          8880       1401       7479          0        223        624
-/+ buffers/cache:        553       8327
Swap:         5119          0       5119

I've tried renicing pbs_server to 20, and ionicing it to class 3 (idle) to
no avail. Anyone have any other ideas?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130420/ddf3b929/attachment-0001.html 

More information about the torqueusers mailing list