[torqueusers] Slave Node Issues - 10 Node Brand new Cluster - Jobs not completing

Joseph De Nicolo j.denicolo at neu.edu
Thu Feb 6 16:33:55 MST 2014


There is no packet loss or network problem. We have done tests with very
large files (not using torque or NFS) and everything worked properly within
the network as far as speeds and operations completing without data loss or
any issues.


*Joseph De Nicolo*
*Systems & Data Administrator*
*Center for Complex Network Research <http://www.barabasilab.com>*


*Northeastern University*


On Thu, Feb 6, 2014 at 6:26 PM, Moye,Roger V <RVMoye at mdanderson.org> wrote:

>  Have you been able to determine if there is packet loss?  If no packet
> loss then the network is possibly not the problem, unless of course the
> NICs happened to negotiate to a slow speed.
>
>
>
> ethtool will show you the current speed that your NICs are using.
>
> You might also try iperf (you will have to download and compile it) to
> test your network bandwidth between your nodes and the server to make sure
> it is what you think it should be.
>
>
>
> -Roger
>
>
>
> -----------------------------------------------------------
>
> Roger V. Moye
>
> Systems Analyst III
>
> XSEDE Campus Champion
>
> University of Texas - MD Anderson Cancer Center
>
> Division of Quantitative Sciences
>
> Pickens Academic Tower - FCT4.6109
>
> Houston, Texas
>
> (713) 792-2134
>
> -----------------------------------------------------------
>
>
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *Joseph De Nicolo
> *Sent:* Thursday, February 06, 2014 10:59 AM
> *To:* Torque Users Mailing List
>
> *Subject:* Re: [torqueusers] Slave Node Issues - 10 Node Brand new
> Cluster - Jobs not completing
>
>
>
> All of the NICs on each node is set to 10/100/1000 and full-duplex.
> However, could "Auto-negotiation = on" cause the issue? Anybody have any
> experience with changing default settings on a switch for a cluster? Does
> this auto-negotiation cause a lot of network overhead? I'm guessing I
> should just manually set the speeds of each NIC and switch.
>
>
>
>
> *Joseph De Nicolo*
> *Systems & Data Administrator*
>
> *Center for Complex Network Research <http://www.barabasilab.com>*
>
>
> *Northeastern University *
>
>
>
> On Wed, Feb 5, 2014 at 5:45 PM, Moye,Roger V <RVMoye at mdanderson.org>
> wrote:
>
> While the cluster is idle try pinging every compute node from the server
> node and check for packet loss.   Assuming you find none, re-run your test
> scenario again and then ping again.   See if any nodes show packet loss.
> If so, you have a network problem.   One common thing, though I confess
> I’ve not seen this lately, would be that a NIC on one or more of the nodes
> (or perhaps the server) has negotiated to the wrong speed, such as the NIC
> being at half-duplex and the switch being at full-duplex.  This will wreck
> things.   You might not see the problem when the network is idle but you
> will definitely notice when the network is busy.
>
>
>
> -Roger
>
>
>
>
>
> -----------------------------------------------------------
>
> Roger V. Moye
>
> Systems Analyst III
>
> XSEDE Campus Champion
>
> University of Texas - MD Anderson Cancer Center
>
> Division of Quantitative Sciences
>
> Pickens Academic Tower - FCT4.6109
>
> Houston, Texas
>
> (713) 792-2134
>
> -----------------------------------------------------------
>
>
>
> *From:* torqueusers-bounces at supercluster.org [mailto:
> torqueusers-bounces at supercluster.org] *On Behalf Of *Joseph De Nicolo
> *Sent:* Monday, February 03, 2014 4:35 PM
> *To:* rf at q-leap.de; Torque Users Mailing List
> *Subject:* Re: [torqueusers] Slave Node Issues - 10 Node Brand new
> Cluster - Jobs not completing
>
>
>
> After adjusting the NFSD count to 64, we are still having some
> communication issues but they seem to be more specific now so hopefully
> somebody can give some insight on a more exact solution. Here are a couple
> of scenarios of job submissions and the current issues at hand:
>
> During both of these tests, no other jobs or processes were running on any
> node as they were all free.
>
>
>
> It seems the NFS overhead is way too high even in this simple scenario:
>
> 1. A single job process was spawned on the torque server (network file
> system mounted locally), just simple writing to a file which completed in
> under 10 seconds.
>
> 2. The same single job process was spawned on a torque MOM node (network
> file system mounted via NFSv4), took 10 minutes.
>
>
>
> Another scenario is showing possible collision or halting of other jobs
> when there are no priorities set:
>
> 1. 10 jobs spawned on torque mom (child node 1) - they were all running
> concurrently and completed in a reasonable time.
>
> 2. While the original 10 jobs were running again, this time we spawned 10
> more on a different torque mom (child node 2).
>
> 3. The jobs were all in the same queue, run by the same person, so no
> priority factor.
>
> 4. When the 10 new jobs on child 2 were spawned, it affected the run times
> of the jobs on child 01 and nothing completed. They were bouncing back and
> forth from "R" status to "D" in ps aux, and there was a lot of IO wait.
> Even when only 20 jobs running on a 10 node cluster with 64 NFSD daemons.
>
> Could this be a general network issue? Does NFSv4 have to be configured in
> more depth to possible allow bigger block sizes? Any help or ideas on the
> matter would be of great help. Thanks all!
>
>
>
>
> *Joseph De Nicolo*
> *Systems & Data Administrator*
>
> *Center for Complex Network Research <http://www.barabasilab.com>*
>
>
> *Northeastern University [image: Image removed by sender.]*
>
>
>
> On Thu, Jan 30, 2014 at 12:58 PM, <rf at q-leap.de> wrote:
>
> >>>>> "Joseph" == Joseph De Nicolo <j.denicolo at neu.edu> writes:
>
> Hi Joseph,
>
>     Joseph> Thank you everybody for all the tips.  After some analysis I
>     Joseph> think the root of the problem is with NFS. Using iostat I
>     Joseph> can see some I/O wait% of average 10%.  We ran a test job on
>     Joseph> the head node where the storage is directly attached, and
>     Joseph> the job had "running" status and completed in an appropriate
>     Joseph> amount of time.
>     Joseph>  Running the same job on a child node resulted in the job
>     Joseph>  being flagged as
>     Joseph> "D" - uninterrupted sleep. Note there were other jobs
>     Joseph> currently running on the cluster using up I/O. The job only
>     Joseph> wrote 17Mb but on the head node it took 30 seconds.. while
>     Joseph> the child node was still showing "D" status in "ps" after 25
>     Joseph> minutes.
>
>     Joseph> This is the first cluster I ever built. After reading up on
>     Joseph> NFS, I realized a default NFS server only spawns 8 nfsd
>     Joseph> processes to handle I/O requests and that you should raise
>     Joseph> this number. Do you think this is the root of the problem?
>     Joseph> Can anybody advise me on how to raise the number of nfsd
>     Joseph> spawns for a NFSv4 server on ubuntu 12.04? Also what is a
>     Joseph> good number for a cluster that is 10 nodes, 132 cores.
>
> Edit /etc/default/nfs-kernel-server and adjust
>
> RPCNFSDCOUNT
>
> Afterwards:
>
> $ /etc/init.d/nfs-kernel-server restart
>
> I'd say try 32 threads to start with. Increase in steps of 32 until things
> look better. Of course it might also turn out that NFS is not up to the
> job at
> all, which would have to make you think about Lustre etc. Really
> depends what your applications do.
>
> As an Ubuntu fan, you might find Qlustar of interest to you.
>
> Best,
>
> Roland
>
> ----
> Roland Fehrenbacher, PhD
> Founder/CEO
> Q-Leap Networks GmbH
> Tel. : +49(0)7034/277620
> EMail: rf at q-leap.com
> http://www.q-leap.com / http://qlustar.com
>
>     Joseph> *Joseph De Nicolo* *Systems & Data Administrator* *Center
>     Joseph> for Complex Network Research <http://www.barabasilab.com>*
>
>
>     Joseph> *Northeastern University*
>
>
>     Joseph> On Wed, Jan 29, 2014 at 2:43 PM, Michael Jennings
>
>     Joseph> <mej at lbl.gov> wrote:
>
>     >> On Tue, Jan 28, 2014 at 9:30 AM, Joseph De Nicolo
>     >> <j.denicolo at neu.edu> wrote:
>     >>
>     >> > Thanks for all the tips on how to get to the bottom of the
>     >> > issues. Here
>     >> is just a trivial test and one of the errors we are receiving
>     >> with the cluster:
>     >> >
>     >> > echo "abc" > abc.txt | qsub -l
>     >> > nodes=mobs-child04,walltime=24:00:00
>     >> >
>     >> > The file abc.txt was correctly written;
>     >>
>     >> Yes, because your command created it.  echo "abc" > abc.txt
>     >> creates abc.txt immediately.  The fact that the file exists has
>     >> nothing to do with TORQUE or whether or not your job ran.
>     >>
>     >> What you probably intended was:
>     >>
>     >> echo 'echo "abc" > abc.txt' | qsub -l
>     >> nodes=mobs-child04,walltime=24:00:00
>     >>
>     >> > This was reported by one of my users.. I just ran the same
>     >> > exact test on
>     >> a different node:
>     >> > echo "xyz" > test.txt | qsub -l
>     >> > nodes=mobs-child01,walltime=24:00:00
>     >> >
>     >> > the file test.txt was correctly written with contents "xyz" but
>     >> > my job
>     >> is still listed in "qstat" with a "Q" state as if it was never
>     >> written. I did not receive any STDIN file as well.
>     >>
>     >> Again, this is because the file creation is happening when you
>     >> run the command.  Nothing to do with the queuing system.
>     >>
>     >> Unfortunately, that also means nothing here is relevant to
>     >> addressing whatever problems you're having with TORQUE.  But from
>     >> the sounds of it, your jobs are overloading the nodes.  Try a
>     >> sleep command instead (for, say, 86400 seconds).
>     >>
>     >> Michael
>     >>
>     >> -- Michael Jennings <mej at lbl.gov> Senior HPC Systems Engineer
>     >> High-Performance Computing Services Lawrence Berkeley National
>     >> Laboratory Bldg 50B-3209E W: 510-495-2687 MS 050B-3209 F:
>     >> 510-486-8615 _______________________________________________
>     >> torqueusers mailing list torqueusers at supercluster.org
>     >> http://www.supercluster.org/mailman/listinfo/torqueusers
>     >>
>
>     Joseph> _______________________________________________ torqueusers
>     Joseph> mailing list torqueusers at supercluster.org
>     Joseph> http://www.supercluster.org/mailman/listinfo/torqueusers
>
> --
> ----
> Dr. Roland Fehrenbacher
> Geschäftsführer
>
> Q-Leap Networks GmbH
> Königstrasse 17/3
> D-71139 Ehningen
> Tel. : +49(0)7034/277620
> Fax  : +49(0)7034/652836
> EMail: rf at q-leap.de
> http://www.q-leap.de
>
> Handelsregister Amtsgericht Stuttgart HRB 245373
> St.-Nr. 56/464/05060
> USt-IdNr. DE220607026
> Geschäftsführer:  Dr. Roland Fehrenbacher
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140206/934259fc/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 823 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20140206/934259fc/attachment-0001.jpe 


More information about the torqueusers mailing list