[torqueusers] Interaction with NFS caused by high job count
Brock Palen
brockp at umich.edu
Thu Mar 17 11:15:02 MDT 2011
Yes,
Check the number of privilege ports in use,
You can solve this a few ways we tune TCP settings to avoid this:
#http://www.clusterresources.com/pipermail/torqueusers/2009-February/008715.html
# release sockets faster because we use a lot of them
net.ipv4.tcp_fin_timeout = 20
# Reuse sockets as fast as possible
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
You can also build torque to not use priv ports.
Lastly you can increate job_stat_rate,
Note that the number of connections is proportional to the number of jobs not nodes.
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985
On Mar 17, 2011, at 12:51 PM, Kevin Van Workum wrote:
> Has anybody ever noticed any problems with mounting NFS's on the machine running pbs_server?
>
> We've seen some issues when the pbs server machine tries to mount NFS shares if we have a large number of running jobs (700-1000 jobs). The error is:
>
> mount.nfs: input/output error
>
> The error is inconsistent. Sometimes it works, other times not. I'm guessing I have to many tcp connections open, but it seems like 1000 jobs shouldn't cause a problem. Any ideas?
>
> --
> Kevin Van Workum, PhD
> Sabalcore Computing Inc.
> Run your code on 500 processors.
> Sign up for a free trial account.
> www.sabalcore.com
> 877-492-8027 ext. 11
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list