[torqueusers] Communication between pbs_mom-s kills jobs
gadre at wisc.edu
Sun Aug 29 09:05:48 MDT 2010
We use PBS 2.3 in our cluster. For past few days we are experiencing a heavy load on NFS and a concurrent end of jobs when PBS_moms fail to connect/ talk to each other. The error is:
=>> PBS: job killed: node 1 (compute-0-16) requested job terminate, 'EOF' (code 1099) - received SISTER_EOF attempting to communicate with sister MOM's
mpiexec: Warning: task 0 exited with status 1.
mpiexec: Warning: task 8 died with signal 501235728 (Unknown signal 501235728).
mpiexec: Warning: task 10 exited with status -510318024.
mpiexec: Warning: task 11 exited with status 61.
mpiexec: Warning: task 14 exited with status 33.
I get from the user forum that this is when NFS is very much swamped, which is true here. What I do not know is how to solve this.
Can someone please help me with how to solve the problem? I am a new administrator and I would appreciate all the guidance and help!
More information about the torqueusers