[torqueusers] Slave Node Issues - 10 Node Brand new Cluster - Jobs not completing

Moye,Roger V RVMoye at mdanderson.org
Wed Feb 5 15:45:07 MST 2014

While the cluster is idle try pinging every compute node from the server node and check for packet loss.   Assuming you find none, re-run your test scenario again and then ping again.   See if any nodes show packet loss.   If so, you have a network problem.   One common thing, though I confess I’ve not seen this lately, would be that a NIC on one or more of the nodes (or perhaps the server) has negotiated to the wrong speed, such as the NIC being at half-duplex and the switch being at full-duplex.  This will wreck things.   You might not see the problem when the network is idle but you will definitely notice when the network is busy.


Roger V. Moye
Systems Analyst III
XSEDE Campus Champion
University of Texas - MD Anderson Cancer Center
Division of Quantitative Sciences
Pickens Academic Tower - FCT4.6109
Houston, Texas
(713) 792-2134

From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Joseph De Nicolo
Sent: Monday, February 03, 2014 4:35 PM
To: rf at q-leap.de; Torque Users Mailing List
Subject: Re: [torqueusers] Slave Node Issues - 10 Node Brand new Cluster - Jobs not completing

After adjusting the NFSD count to 64, we are still having some communication issues but they seem to be more specific now so hopefully somebody can give some insight on a more exact solution. Here are a couple of scenarios of job submissions and the current issues at hand:
During both of these tests, no other jobs or processes were running on any node as they were all free.

It seems the NFS overhead is way too high even in this simple scenario:
1. A single job process was spawned on the torque server (network file system mounted locally), just simple writing to a file which completed in under 10 seconds.
2. The same single job process was spawned on a torque MOM node (network file system mounted via NFSv4), took 10 minutes.

Another scenario is showing possible collision or halting of other jobs when there are no priorities set:
1. 10 jobs spawned on torque mom (child node 1) - they were all running concurrently and completed in a reasonable time.
2. While the original 10 jobs were running again, this time we spawned 10 more on a different torque mom (child node 2).
3. The jobs were all in the same queue, run by the same person, so no priority factor.
4. When the 10 new jobs on child 2 were spawned, it affected the run times of the jobs on child 01 and nothing completed. They were bouncing back and forth from "R" status to "D" in ps aux, and there was a lot of IO wait. Even when only 20 jobs running on a 10 node cluster with 64 NFSD daemons.
Could this be a general network issue? Does NFSv4 have to be configured in more depth to possible allow bigger block sizes? Any help or ideas on the matter would be of great help. Thanks all!

Joseph De Nicolo
Systems & Data Administrator
Center for Complex Network Research<http://www.barabasilab.com>
Northeastern University
[Image removed by sender.]

On Thu, Jan 30, 2014 at 12:58 PM, <rf at q-leap.de<mailto:rf at q-leap.de>> wrote:
>>>>> "Joseph" == Joseph De Nicolo <j.denicolo at neu.edu<mailto:j.denicolo at neu.edu>> writes:

Hi Joseph,

    Joseph> Thank you everybody for all the tips.  After some analysis I
    Joseph> think the root of the problem is with NFS. Using iostat I
    Joseph> can see some I/O wait% of average 10%.  We ran a test job on
    Joseph> the head node where the storage is directly attached, and
    Joseph> the job had "running" status and completed in an appropriate
    Joseph> amount of time.
    Joseph>  Running the same job on a child node resulted in the job
    Joseph>  being flagged as
    Joseph> "D" - uninterrupted sleep. Note there were other jobs
    Joseph> currently running on the cluster using up I/O. The job only
    Joseph> wrote 17Mb but on the head node it took 30 seconds.. while
    Joseph> the child node was still showing "D" status in "ps" after 25
    Joseph> minutes.

    Joseph> This is the first cluster I ever built. After reading up on
    Joseph> NFS, I realized a default NFS server only spawns 8 nfsd
    Joseph> processes to handle I/O requests and that you should raise
    Joseph> this number. Do you think this is the root of the problem?
    Joseph> Can anybody advise me on how to raise the number of nfsd
    Joseph> spawns for a NFSv4 server on ubuntu 12.04? Also what is a
    Joseph> good number for a cluster that is 10 nodes, 132 cores.

Edit /etc/default/nfs-kernel-server and adjust



$ /etc/init.d/nfs-kernel-server restart

I'd say try 32 threads to start with. Increase in steps of 32 until things
look better. Of course it might also turn out that NFS is not up to the job at
all, which would have to make you think about Lustre etc. Really
depends what your applications do.

As an Ubuntu fan, you might find Qlustar of interest to you.



Roland Fehrenbacher, PhD
Q-Leap Networks GmbH
Tel. : +49(0)7034/277620
EMail: rf at q-leap.com<mailto:rf at q-leap.com>
http://www.q-leap.com / http://qlustar.com

    Joseph> *Joseph De Nicolo* *Systems & Data Administrator* *Center
    Joseph> for Complex Network Research <http://www.barabasilab.com>*

    Joseph> *Northeastern University*

    Joseph> On Wed, Jan 29, 2014 at 2:43 PM, Michael Jennings
    Joseph> <mej at lbl.gov<mailto:mej at lbl.gov>> wrote:

    >> On Tue, Jan 28, 2014 at 9:30 AM, Joseph De Nicolo
    >> <j.denicolo at neu.edu<mailto:j.denicolo at neu.edu>> wrote:
    >> > Thanks for all the tips on how to get to the bottom of the
    >> > issues. Here
    >> is just a trivial test and one of the errors we are receiving
    >> with the cluster:
    >> >
    >> > echo "abc" > abc.txt | qsub -l
    >> > nodes=mobs-child04,walltime=24:00:00
    >> >
    >> > The file abc.txt was correctly written;
    >> Yes, because your command created it.  echo "abc" > abc.txt
    >> creates abc.txt immediately.  The fact that the file exists has
    >> nothing to do with TORQUE or whether or not your job ran.
    >> What you probably intended was:
    >> echo 'echo "abc" > abc.txt' | qsub -l
    >> nodes=mobs-child04,walltime=24:00:00
    >> > This was reported by one of my users.. I just ran the same
    >> > exact test on
    >> a different node:
    >> > echo "xyz" > test.txt | qsub -l
    >> > nodes=mobs-child01,walltime=24:00:00
    >> >
    >> > the file test.txt was correctly written with contents "xyz" but
    >> > my job
    >> is still listed in "qstat" with a "Q" state as if it was never
    >> written. I did not receive any STDIN file as well.
    >> Again, this is because the file creation is happening when you
    >> run the command.  Nothing to do with the queuing system.
    >> Unfortunately, that also means nothing here is relevant to
    >> addressing whatever problems you're having with TORQUE.  But from
    >> the sounds of it, your jobs are overloading the nodes.  Try a
    >> sleep command instead (for, say, 86400 seconds).
    >> Michael
    >> -- Michael Jennings <mej at lbl.gov<mailto:mej at lbl.gov>> Senior HPC Systems Engineer
    >> High-Performance Computing Services Lawrence Berkeley National
    >> Laboratory Bldg 50B-3209E W: 510-495-2687<tel:510-495-2687> MS 050B-3209 F:
    >> 510-486-8615<tel:510-486-8615> _______________________________________________
    >> torqueusers mailing list torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
    >> http://www.supercluster.org/mailman/listinfo/torqueusers
    Joseph> _______________________________________________ torqueusers
    Joseph> mailing list torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
    Joseph> http://www.supercluster.org/mailman/listinfo/torqueusers

Dr. Roland Fehrenbacher

Q-Leap Networks GmbH
Königstrasse 17/3
D-71139 Ehningen
Tel. : +49(0)7034/277620<tel:%2B49%280%297034%2F277620>
Fax  : +49(0)7034/652836<tel:%2B49%280%297034%2F652836>
EMail: rf at q-leap.de<mailto:rf at q-leap.de>

Handelsregister Amtsgericht Stuttgart HRB 245373
St.-Nr. 56/464/05060
USt-IdNr. DE220607026
Geschäftsführer:  Dr. Roland Fehrenbacher
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140205/03b4ef45/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ~WRD000.jpg
Type: image/jpeg
Size: 823 bytes
Desc: ~WRD000.jpg
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20140205/03b4ef45/attachment-0001.jpg 

More information about the torqueusers mailing list