[torqueusers] Communication between pbs_mom-s kills jobs

Axel Kohlmeyer akohlmey at cmm.chem.upenn.edu
Sun Aug 29 12:26:51 MDT 2010


On Sun, Aug 29, 2010 at 11:05 AM, Milind <gadre at wisc.edu> wrote:
> Hello all,
>
> We use PBS 2.3 in our cluster. For past few days we are experiencing a heavy load on NFS and a concurrent end of jobs when PBS_moms fail to connect/ talk to each other. The error is:
>
> =>> PBS: job killed: node 1 (compute-0-16) requested job terminate, 'EOF' (code 1099) - received SISTER_EOF attempting to communicate with sister MOM's
> mpiexec: Warning: task 0 exited with status 1.
> mpiexec: Warning: task 8 died with signal 501235728 (Unknown signal 501235728).
> mpiexec: Warning: task 10 exited with status -510318024.
> mpiexec: Warning: task 11 exited with status 61.
> mpiexec: Warning: task 14 exited with status 33.
>
>
> I get from the user forum that this is when NFS is very much swamped, which is true here. What I do not know is how to solve this.

find out what is overloading the NFS and stop that.
in my experience, more often than not this happens,
when (some) user(s) is/are doing something very,
very stupid.

an overloaded NFS inconveniences all, so addressing
this from the side of PBS is pretty much useless.

>
> Can someone please help me with how to solve the problem? I am a new administrator and I would appreciate all the guidance and help!

find a more experienced local administrator
and discuss your problems.

cheers,
    axel.

> thanks!!
> --Milind
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Dr. Axel Kohlmeyer    akohlmey at gmail.com
http://sites.google.com/site/akohlmey/

Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.


More information about the torqueusers mailing list