[torqueusers] Slave Node Issues - 10 Node Brand new Cluster - Jobs not completing
RVMoye at mdanderson.org
Wed Feb 5 15:45:07 MST 2014
While the cluster is idle try pinging every compute node from the server node and check for packet loss. Assuming you find none, re-run your test scenario again and then ping again. See if any nodes show packet loss. If so, you have a network problem. One common thing, though I confess I’ve not seen this lately, would be that a NIC on one or more of the nodes (or perhaps the server) has negotiated to the wrong speed, such as the NIC being at half-duplex and the switch being at full-duplex. This will wreck things. You might not see the problem when the network is idle but you will definitely notice when the network is busy.
Roger V. Moye
Systems Analyst III
XSEDE Campus Champion
University of Texas - MD Anderson Cancer Center
Division of Quantitative Sciences
Pickens Academic Tower - FCT4.6109
From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Joseph De Nicolo
Sent: Monday, February 03, 2014 4:35 PM
To: rf at q-leap.de; Torque Users Mailing List
Subject: Re: [torqueusers] Slave Node Issues - 10 Node Brand new Cluster - Jobs not completing
After adjusting the NFSD count to 64, we are still having some communication issues but they seem to be more specific now so hopefully somebody can give some insight on a more exact solution. Here are a couple of scenarios of job submissions and the current issues at hand:
During both of these tests, no other jobs or processes were running on any node as they were all free.
It seems the NFS overhead is way too high even in this simple scenario:
1. A single job process was spawned on the torque server (network file system mounted locally), just simple writing to a file which completed in under 10 seconds.
2. The same single job process was spawned on a torque MOM node (network file system mounted via NFSv4), took 10 minutes.
Another scenario is showing possible collision or halting of other jobs when there are no priorities set:
1. 10 jobs spawned on torque mom (child node 1) - they were all running concurrently and completed in a reasonable time.
2. While the original 10 jobs were running again, this time we spawned 10 more on a different torque mom (child node 2).
3. The jobs were all in the same queue, run by the same person, so no priority factor.
4. When the 10 new jobs on child 2 were spawned, it affected the run times of the jobs on child 01 and nothing completed. They were bouncing back and forth from "R" status to "D" in ps aux, and there was a lot of IO wait. Even when only 20 jobs running on a 10 node cluster with 64 NFSD daemons.
Could this be a general network issue? Does NFSv4 have to be configured in more depth to possible allow bigger block sizes? Any help or ideas on the matter would be of great help. Thanks all!
Joseph De Nicolo
Systems & Data Administrator
Center for Complex Network Research<http://www.barabasilab.com>
[Image removed by sender.]
On Thu, Jan 30, 2014 at 12:58 PM, <rf at q-leap.de<mailto:rf at q-leap.de>> wrote:
>>>>> "Joseph" == Joseph De Nicolo <j.denicolo at neu.edu<mailto:j.denicolo at neu.edu>> writes:
Joseph> Thank you everybody for all the tips. After some analysis I
Joseph> think the root of the problem is with NFS. Using iostat I
Joseph> can see some I/O wait% of average 10%. We ran a test job on
Joseph> the head node where the storage is directly attached, and
Joseph> the job had "running" status and completed in an appropriate
Joseph> amount of time.
Joseph> Running the same job on a child node resulted in the job
Joseph> being flagged as
Joseph> "D" - uninterrupted sleep. Note there were other jobs
Joseph> currently running on the cluster using up I/O. The job only
Joseph> wrote 17Mb but on the head node it took 30 seconds.. while
Joseph> the child node was still showing "D" status in "ps" after 25
Joseph> This is the first cluster I ever built. After reading up on
Joseph> NFS, I realized a default NFS server only spawns 8 nfsd
Joseph> processes to handle I/O requests and that you should raise
Joseph> this number. Do you think this is the root of the problem?
Joseph> Can anybody advise me on how to raise the number of nfsd
Joseph> spawns for a NFSv4 server on ubuntu 12.04? Also what is a
Joseph> good number for a cluster that is 10 nodes, 132 cores.
Edit /etc/default/nfs-kernel-server and adjust
$ /etc/init.d/nfs-kernel-server restart
I'd say try 32 threads to start with. Increase in steps of 32 until things
look better. Of course it might also turn out that NFS is not up to the job at
all, which would have to make you think about Lustre etc. Really
depends what your applications do.
As an Ubuntu fan, you might find Qlustar of interest to you.
Roland Fehrenbacher, PhD
Q-Leap Networks GmbH
Tel. : +49(0)7034/277620
EMail: rf at q-leap.com<mailto:rf at q-leap.com>
http://www.q-leap.com / http://qlustar.com
Joseph> *Joseph De Nicolo* *Systems & Data Administrator* *Center
Joseph> for Complex Network Research <http://www.barabasilab.com>*
Joseph> *Northeastern University*
Joseph> On Wed, Jan 29, 2014 at 2:43 PM, Michael Jennings
Joseph> <mej at lbl.gov<mailto:mej at lbl.gov>> wrote:
>> On Tue, Jan 28, 2014 at 9:30 AM, Joseph De Nicolo
>> <j.denicolo at neu.edu<mailto:j.denicolo at neu.edu>> wrote:
>> > Thanks for all the tips on how to get to the bottom of the
>> > issues. Here
>> is just a trivial test and one of the errors we are receiving
>> with the cluster:
>> > echo "abc" > abc.txt | qsub -l
>> > nodes=mobs-child04,walltime=24:00:00
>> > The file abc.txt was correctly written;
>> Yes, because your command created it. echo "abc" > abc.txt
>> creates abc.txt immediately. The fact that the file exists has
>> nothing to do with TORQUE or whether or not your job ran.
>> What you probably intended was:
>> echo 'echo "abc" > abc.txt' | qsub -l
>> > This was reported by one of my users.. I just ran the same
>> > exact test on
>> a different node:
>> > echo "xyz" > test.txt | qsub -l
>> > nodes=mobs-child01,walltime=24:00:00
>> > the file test.txt was correctly written with contents "xyz" but
>> > my job
>> is still listed in "qstat" with a "Q" state as if it was never
>> written. I did not receive any STDIN file as well.
>> Again, this is because the file creation is happening when you
>> run the command. Nothing to do with the queuing system.
>> Unfortunately, that also means nothing here is relevant to
>> addressing whatever problems you're having with TORQUE. But from
>> the sounds of it, your jobs are overloading the nodes. Try a
>> sleep command instead (for, say, 86400 seconds).
>> -- Michael Jennings <mej at lbl.gov<mailto:mej at lbl.gov>> Senior HPC Systems Engineer
>> High-Performance Computing Services Lawrence Berkeley National
>> Laboratory Bldg 50B-3209E W: 510-495-2687<tel:510-495-2687> MS 050B-3209 F:
>> 510-486-8615<tel:510-486-8615> _______________________________________________
>> torqueusers mailing list torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
Joseph> _______________________________________________ torqueusers
Joseph> mailing list torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
Dr. Roland Fehrenbacher
Q-Leap Networks GmbH
Tel. : +49(0)7034/277620<tel:%2B49%280%297034%2F277620>
Fax : +49(0)7034/652836<tel:%2B49%280%297034%2F652836>
EMail: rf at q-leap.de<mailto:rf at q-leap.de>
Handelsregister Amtsgericht Stuttgart HRB 245373
Geschäftsführer: Dr. Roland Fehrenbacher
torqueusers mailing list
torqueusers at supercluster.org<mailto:torqueusers at supercluster.org>
-------------- next part --------------
An HTML attachment was scrubbed...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 823 bytes
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20140205/03b4ef45/attachment-0001.jpg
More information about the torqueusers