[torqueusers] Communication between pbs_mom-s kills jobs

Steve Crusan scrusan at ur.rochester.edu
Sun Aug 29 18:52:16 MDT 2010




On 8/29/10 2:26 PM, "Axel Kohlmeyer" <akohlmey at cmm.chem.upenn.edu> wrote:

> On Sun, Aug 29, 2010 at 11:05 AM, Milind <gadre at wisc.edu> wrote:
>> Hello all,
>> 
>> We use PBS 2.3 in our cluster. For past few days we are experiencing a heavy
>> load on NFS and a concurrent end of jobs when PBS_moms fail to connect/ talk
>> to each other. The error is:
>> 
>> =>> PBS: job killed: node 1 (compute-0-16) requested job terminate, 'EOF'
>> (code 1099) - received SISTER_EOF attempting to communicate with sister MOM's
>> mpiexec: Warning: task 0 exited with status 1.
>> mpiexec: Warning: task 8 died with signal 501235728 (Unknown signal
>> 501235728).
>> mpiexec: Warning: task 10 exited with status -510318024.
>> mpiexec: Warning: task 11 exited with status 61.
>> mpiexec: Warning: task 14 exited with status 33.
>> 
>> 
>> I get from the user forum that this is when NFS is very much swamped, which
>> is true here. What I do not know is how to solve this.
> 
> find out what is overloading the NFS and stop that.
> in my experience, more often than not this happens,
> when (some) user(s) is/are doing something very,
> very stupid.
> 
> an overloaded NFS inconveniences all, so addressing
> this from the side of PBS is pretty much useless.
> 

I would suggest to your users that they try to write any local information
(needed to each PBS mom) to a /local_scratch disk on the node, and then copy
it to their NFS directory afterwards if possible, at the end of the
job/checkpoint/iteration in code. Basically the premise is that if possible,
read/write information locally to a temporary space, and then copy it to a
cluster wide network accessible filesystem at the end of the jobs...

If certain jobs are requiring high I/O, reads/writes consistently to an NFS
server could degrade overall cluster performance.

I guess you could also look at how many threads NFS is using also...

NFS performance can be a pain, but more often than not, it's a user/service
pushing the current setup to it's limits. Unfortunately that means it's time
to:

- setup a better network configuration (logically, physically, etc) for NFS
- better user education (as I stated above)
- investigate a more suitable cluster filesystem
- beat down users
(remember, this is in no particular order :-) )





>> 
>> Can someone please help me with how to solve the problem? I am a new
>> administrator and I would appreciate all the guidance and help!
> 
> find a more experienced local administrator
> and discuss your problems.
> 
> cheers,
>     axel.
> 
>> thanks!!
>> --Milind
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
> 
> 



----------------------
Steve Crusan
System Administrator
Center for Research Computing
University of Rochester
https://www.crc.rochester.edu/



More information about the torqueusers mailing list